Hi All,
We have a 3 node HCI cluster with Gluster 2+1 volumes.
The first node had a hardware memory failure which caused file corruption to the engine lv
and the server would only boot into maintenance mode.
For some reason glusterd wouldn't start and one of the volumes became inaccessible
with the Storage domain going offline. This caused multiple VMs to go into a paused or
shutdown state.
Putting the host into maintenance mode and then shutting it down was done in an attempt to
allow gluster to continue across 2 nodes (one being the arbiter). Unfortunately this
didn't work.
The solution was to do the following:
1. Remove the contents of /var/lib/glusterd except for glusterd.info
2. Start glusterd
3. Peer probe one of the other 2 peers
4. Restart glusterd
5. Cross fingers and toes
Although this was a successful outcome I would like to know why losing 1 gluster peer
caused the outage of a single storage domain and therefore outages of VMs with disks on
that storage domain.
Kind Regards
Simon...