What you are missing is the fact that gluster requires more than one set of bricks to recover from a dead host. I.e. In your set up, you'd need 6 hosts. 4x replicas and 2x arbiters with at least one set (2x replicas and 1x arbiter) operational bare minimum.
Automated commands to fix the volume do not exist otherwise. (It's a Gluster limitation.) This can be fixed manually however.
Standard Disclaimer: Back up your data first! Fixing this issue requires manual intervention. Reader assumes all responsiblity for any action resulting from the instructions below. Etc.
If it's just a dead brick, (i.e. the host is still functional), all you really need to do is replace the underlying storage:
1. Take the gluster volume offline.
2. Remove the bad storage device, and attach the replacement.
3. rsync / scp / etc. the data from a known good brick (be sure to include hidden files / preserve file times and ownership / SELinux labels / etc. ).
4. Restart the gluster volume.
Gluster *might* still need to heal everything after all of that, but it should start the volume and get it running again.
If the host itself is dead, (and the underlying storage is still functional), you can just move the underlying storage over to the new host:
1. Take the gluster volume offline.
2. Attach the old storage.
4. Restart the gluster volume.
If both the host and underlying storage are dead, you'll need to do both tasks:
1. Take the gluster volume offline.
2. Attach the new storage.
3. rsync / scp / etc. the data from a known good brick (be sure to include hidden files / preserve file times and ownership / SELinux labels / etc. ).
4. Fix up the ids on the volume file.
5. Restart the gluster volume.
Keep in mind one thing however: If the gluster host you are replacing is used by oVirt to connect to the volume (I.e. It's the host named in the volume config in the Admin portal). The new host will need to retain the old hostname / IP, or you'll need to update oVirt's config. Otherwise the VM hosts will wind up in Unassigned / Non-functional status.
- Patrick Hibbs
On Sun, 2022-07-17 at 22:15 +0300, Gilboa Davara wrote:
Hello all,
I'm attempting to replace a dead host in a replica 2 + arbiter gluster setup and replace it with a new host.
I've already set up a new host (same hostname..localdomain) and got into the cluster.
$ gluster peer status
Number of Peers: 2
Hostname: office-wx-hv3-lab-gfs
Uuid: 4e13f796-b818-4e07-8523-d84eb0faa4f9
State: Peer in Cluster (Connected)
Hostname: office-wx-hv1-lab-gfs.localdomain <------ This is a new host.
Uuid: eee17c74-0d93-4f92-b81d-87f6b9c2204d
State: Peer in Cluster (Connected)
$ gluster volume info GV2Data
Volume Name: GV2Data
Type: Replicate
Volume ID: c1946fc2-ed94-4b9f-9da3-f0f1ee90f303
Status: Stopped
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: office-wx-hv1-lab-gfs:/mnt/LogGFSData/brick <------ This is the dead host.
Brick2: office-wx-hv2-lab-gfs:/mnt/LogGFSData/brick
Brick3: office-wx-hv3-lab-gfs:/mnt/LogGFSData/brick (arbiter)
...
Looking at the docs, it seems that I need to remove the dead brick.
$ gluster volume remove-brick GV2Data office-wx-hv1-lab-gfs:/mnt/LogGFSData/brick start
Running remove-brick with cluster.force-migration enabled can result in data corruption. It is safer to disable this option so that files that receive writes during migration are not migrated.
Files that are not migrated can then be manually copied after the remove-brick commit operation.
Do you want to continue with your current cluster.force-migration settings? (y/n) y
volume remove-brick start: failed: Removing bricks from replicate configuration is not allowed without reducing replica count explicitly
So I guess I need to drop from replica 2 + arbiter to replica 1 + arbiter (?).
$ gluster volume remove-brick GV2Data replica 1 office-wx-hv1-lab-gfs:/mnt/LogGFSData/brick start
Running remove-brick with cluster.force-migration enabled can result in data corruption. It is safer to disable this option so that files that receive writes during migration are not migrated.
Files that are not migrated can then be manually copied after the remove-brick commit operation.
Do you want to continue with your current cluster.force-migration settings? (y/n) y
volume remove-brick start: failed: need 2(xN) bricks for reducing replica count of the volume from 3 to 1
... What am I missing?
- Gilboa
_______________________________________________
_______________________________________________