I have a Hyperconverged cluster with 4 hosts.
Gluster is replicated across 2 hosts, and a 3rd host is an arbiter node.
The 4th host is compute only.
I updated the compute-only node, as well as the arbiter node, early this morning. I didn't touch either of the actual storage nodes.That said, I forgot to upgrade the engine.
oVirt Manager thinks that all but 1 of the hosts in the cluster are unhealthy. However, all 4 hosts are online. oVirt Manager (Engine) also keeps deactivating at least 1, if not 2 of the 3 (total) bricks behind each volume.
Even though the Engine thinks that only 1 host is healthy, VMs are clearly running on some of the other hosts. However, in troubleshooting, some of the customer VMs were turned off, and oVirt is refusing to start those VMs, because it only recognizes that 1 of the hosts is healthy -- and that host's resources are maxed out.
This afternoon, I went ahead and upgraded (and rebooted) the Engine VM, so it is now up-to-date. Unfortunately, that didn't resolve the issue. So I took one of the "unhealthy" hosts which didn't have any VMs on it (which was the host that is our compute-only server hosting no gluster data), and I used oVirt to "reinstall" the oVirt software. That didn't resolve the issue for that host.
How can I troubleshoot this? I need:
- To figure out why oVirt keeps trying to deactivate volumes
- From the command line, `gluster peer status` show all nodes connected, and all volumes appear to be healthy
- More importantly, I need to get these VMs that are currently down back online. Is there a way to somehow force oVirt to launch the VMs on the "unhealthy" nodes?
What logs should I be looking at? Any help would be greatly appreciated .