Major problems after upgrading 2 (of 4) Red Hat hosts to 4.4.10

newer
Re: Failed HostedEngine Deployment

older
Re: Failed HostedEngine Deployment

David White

22 Jan 2022 22 Jan '22

10:01 p.m.

I have a Hyperconverged cluster with 4 hosts. Gluster is replicated across 2 hosts, and a 3rd host is an arbiter node. The 4th host is compute only. I updated the compute-only node, as well as the arbiter node, early this morning. I didn't touch either of the actual storage nodes.That said, I forgot to upgrade the engine. oVirt Manager thinks that all but 1 of the hosts in the cluster are unhealthy. However, all 4 hosts are online. oVirt Manager (Engine) also keeps deactivating at least 1, if not 2 of the 3 (total) bricks behind each volume. Even though the Engine thinks that only 1 host is healthy, VMs are clearly running on some of the other hosts. However, in troubleshooting, some of the customer VMs were turned off, and oVirt is refusing to start those VMs, because it only recognizes that 1 of the hosts is healthy -- and that host's resources are maxed out. This afternoon, I went ahead and upgraded (and rebooted) the Engine VM, so it is now up-to-date. Unfortunately, that didn't resolve the issue. So I took one of the "unhealthy" hosts which didn't have any VMs on it (which was the host that is our compute-only server hosting no gluster data), and I used oVirt to "reinstall" the oVirt software. That didn't resolve the issue for that host. How can I troubleshoot this? I need: - To figure out why oVirt keeps trying to deactivate volumes - From the command line, `gluster peer status` show all nodes connected, and all volumes appear to be healthy - More importantly, I need to get these VMs that are currently down back online. Is there a way to somehow force oVirt to launch the VMs on the "unhealthy" nodes? What logs should I be looking at? Any help would be greatly appreciated . Sent with ProtonMail Secure Email.

Attachments:

attachment.html (text/html — 3.2 KB)
publickey-dmwhite823protonmail.com-0x320CD582.asc (application/pgp-keys — 1.8 KB)
signature.asc (application/pgp-signature — 509 bytes)

Show replies by date

David White

22 Jan 22 Jan

10:11 p.m.

Here are some of the logs I see in the oVirt Event Manager on the host that is compute-only, [Screenshot from 2022-01-22 16-10-13.png] Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Saturday, January 22nd, 2022 at 4:01 PM, David White <dmwhite823@protonmail.com> wrote:

...

I have a Hyperconverged cluster with 4 hosts.

...

Gluster is replicated across 2 hosts, and a 3rd host is an arbiter node. The 4th host is compute only.

...

I updated the compute-only node, as well as the arbiter node, early this morning. I didn't touch either of the actual storage nodes.That said, I forgot to upgrade the engine.

...

oVirt Manager thinks that all but 1 of the hosts in the cluster are unhealthy. However, all 4 hosts are online. oVirt Manager (Engine) also keeps deactivating at least 1, if not 2 of the 3 (total) bricks behind each volume.

...

Even though the Engine thinks that only 1 host is healthy, VMs are clearly running on some of the other hosts. However, in troubleshooting, some of the customer VMs were turned off, and oVirt is refusing to start those VMs, because it only recognizes that 1 of the hosts is healthy -- and that host's resources are maxed out.

...

This afternoon, I went ahead and upgraded (and rebooted) the Engine VM, so it is now up-to-date. Unfortunately, that didn't resolve the issue. So I took one of the "unhealthy" hosts which didn't have any VMs on it (which was the host that is our compute-only server hosting no gluster data), and I used oVirt to "reinstall" the oVirt software. That didn't resolve the issue for that host.

...

How can I troubleshoot this? I need:

...

- To figure out why oVirt keeps trying to deactivate volumes

...

- From the command line, `gluster peer status` show all nodes connected, and all volumes appear to be healthy

...

- More importantly, I need to get these VMs that are currently down back online. Is there a way to somehow force oVirt to launch the VMs on the "unhealthy" nodes?

...

What logs should I be looking at? Any help would be greatly appreciated .

...

Sent with ProtonMail Secure Email.

Strahil Nikolov

11:48 p.m.

New subject: Major problems after upgrading 2 (of 4) Red Hat hosts to 4.4.10

Check the situation with the gluster storage.Also, you can check if the compute-only node is added to the gluster trsuted storage pool:gluster pool list (should show localhost -+ 2 more). Also, consider upgrading the engine. Most probably the engine cannot connect to the vdsm and vdsm-gluster and this makes the whole situation worse. Best Regards,Strahil Nikolov On Sat, Jan 22, 2022 at 23:15, David White via Users<users@ovirt.org> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/UZLDTUYRNCZYCU...

David White

23 Jan 23 Jan

4:13 a.m.

New subject: Major problems after upgrading 2 (of 4) Red Hat hosts to 4.4.10

Thank you very much. Over the course of troubleshooting, I did wind up updating (and rebooting) the engine. The issue "magically" resolved itself a few hours after I sent this email with the two Gluster nodes. They became operational again, and thus, I wound up having enough compute power to bring everything back online. That said, the compute-only node is still not operational. I'll take a look at the "gluster pool list" command. Thank you for that. Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Saturday, January 22nd, 2022 at 5:48 PM, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...

Check the situation with the gluster storage.Also, you can check if the compute-only node is added to the gluster trsuted storage pool:gluster pool list (should show localhost -+ 2 more).

...

Also, consider upgrading the engine. Most probably the engine cannot connect to the vdsm and vdsm-gluster and this makes the whole situation worse.

...

Best Regards,Strahil Nikolov

...

...
On Sat, Jan 22, 2022 at 23:15, David White via Users<users@ovirt.org> wrote:_______________________________________________

...

...
Users mailing list -- users@ovirt.org

...

...
To unsubscribe send an email to users-leave@ovirt.org

...

...
Privacy Statement: https://www.ovirt.org/privacy-policy.html

...

...
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/

...

...
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/UZLDTUYRNCZYCU...

Strahil Nikolov

9:31 p.m.

New subject: Major problems after upgrading 2 (of 4) Red Hat hosts to 4.4.10

The oVirt settings on the Gluster have server quorum enabled . If quorum is less than 50% +1 servers -> all bricks will shutdown. If the compute-only node is part of the TSP (Gluster's cluster) , it will also be calculated for quorum and in most cases you don't want that. If it's so - just remove it (from the main nodes) from the TSP. Best Regards, Strahil Nikolov В неделя, 23 януари 2022 г., 05:13:25 Гринуич+2, David White <dmwhite823@protonmail.com> написа: Thank you very much. Over the course of troubleshooting, I did wind up updating (and rebooting) the engine. The issue "magically" resolved itself a few hours after I sent this email with the two Gluster nodes. They became operational again, and thus, I wound up having enough compute power to bring everything back online. That said, the compute-only node is still not operational. I'll take a look at the "gluster pool list" command. Thank you for that. Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Saturday, January 22nd, 2022 at 5:48 PM, Strahil Nikolov <hunter86_bg@yahoo.com> wrote: Check the situation with the gluster storage.Also, you can check if the compute-only node is added to the gluster trsuted storage pool:gluster pool list (should show localhost -+ 2 more). Also, consider upgrading the engine. Most probably the engine cannot connect to the vdsm and vdsm-gluster and this makes the whole situation worse. Best Regards,Strahil Nikolov On Sat, Jan 22, 2022 at 23:15, David White via Users<users@ovirt.org> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/UZLDTUYRNCZYCU...

Strahil Nikolov

11:38 p.m.

New subject: Major problems after upgrading 2 (of 4) Red Hat hosts to 4.4.10

*cluster.server-quorum-ratio is set to 51% which in your case means that you can afford only 1 server down (both 3-node TSP and 4-node TSP). Best Regards,Strahil Nikolov On Sun, Jan 23, 2022 at 22:46, Strahil Nikolov via Users<users@ovirt.org> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/6OIHCOCRJ5IAQT...

1299

Age (days ago)

1300

Last active (days ago)

List overview

Download

5 comments

2 participants

participants (2)

David White
Strahil Nikolov