[ovirt-users] Single Peer in Gluster Cluster Failure caused Storage Domain outage

4 Aug 2022

      Hi All,

We have a 3 node HCI cluster with Gluster 2+1 volumes.

The first node had a hardware memory failure which caused file corruption to the engine lv and the server would only boot into maintenance mode.

For some reason glusterd wouldn't start and one of the volumes  became inaccessible with the Storage domain going offline. This caused multiple VMs to go into a paused or shutdown state.
Putting the host into maintenance mode and then shutting it down was done in an attempt to allow gluster to continue across 2 nodes (one being the arbiter). Unfortunately this didn't work.

The solution was to do the following:
1. Remove the contents of /var/lib/glusterd except for glusterd.info
2. Start glusterd
3. Peer probe one of the other 2 peers
4. Restart glusterd
5. Cross fingers and toes

Although this was a successful outcome I would like to know why losing 1 gluster peer caused the outage of a single storage domain and therefore outages of VMs with disks on that storage domain.

Kind Regards

Simon...

[ovirt-users] Single Peer in Gluster Cluster Failure caused Storage Domain outage

simon＠justconnect.ie