Hello everyone,
Any help would be greatly appreciated in the following problem.
In my lab, the day before yesterday, we had power issues, with a UPS going off-line and
following the power outage of the NFS/DNS server I have set up to serve ovirt with isos
and as a DNS server (our other DNS servers are located as VMs within the oVirt
environment). We found a broadcast storm on the switch (due to a faulty NIC on the
aformentioned UPS) that the ovirt nodes are connected and later on had to re-establish
several of the virtual connections as well. The above led to one of the hosts becoming
NonResponsive, two machines becoming unresponsive and three VMs shuting down.
The oVirt environment, version 4.3.5.2, is a replica 2 + arbiter 1 environment and runs
GlusterFS with the recommended volumes of data, engine and vmstore.
So far, the times there was some kind of a problem, usually oVirt was able to solve it by
its own.
This time, however, after we recovered from the above state, the volumes of data and
vmstore successfully healing , the volume engine became stuck to the healing process (Up,
unsynched entries, needs healing), and from the web GUI I see that the VM HostedEngine is
paused due to a storage I/O error while the output of virsh list --all command shows that
the HostedEngine is running.. How is that happening?
I tried to manually trigger the healing process for the volume but nothing with
gluster volume heal engine
The command
gluster volume heal engine info
shows the following
[root@ov-no3 ~]# gluster volume heal engine info
Brick ov-no1.ariadne-t.local:/gluster_bricks/engine/engine
Status: Connected
Number of entries: 0
Brick ov-no2.ariadne-t.local:/gluster_bricks/engine/engine
/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7
Status: Connected
Number of entries: 1
Brick ov-no3.ariadne-t.local:/gluster_bricks/engine/engine
/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7
Status: Connected
Number of entries: 1
This morning I came upon this Reddit post
https://www.reddit.com/r/gluster/comments/fl3yb7/entries_stuck_in_heal_pe... where it
seems that after a graceful reboot one of the ovirt hosts, the gluster came back online
after it completed the appropriate healing processes. The thing is from what I have read
that when there are unsynched entries in the gluster a host cannot be put into maintenance
mode so that it can be rebooted, correct?
Should I try to restart the glusterd service.
Could someone tell me what I should do?
Thank you all for your time and help,
Maria Souvalioti