[ovirt-users] Re: Replace bad Host from a 9 Node hyperconverged setup 4.3.3

11 Jun 2019


      adding gluster pool list:
UUID Hostname         State
2c86fa95-67a2-492d-abf0-54da625417f8  vmm12.mydomain.com Connected
ab099e72-0f56-4d33-a16b-ba67d67bdf9d  vmm13.mydomain.com Connected
c35ad74d-1f83-4032-a459-079a27175ee4 vmm14.mydomain.com Connected
aeb7712a-e74e-4492-b6af-9c266d69bfd3  vmm17.mydomain.com Connected
4476d434-d6ff-480f-b3f1-d976f642df9c     vmm16.mydomain.com Connected
22ec0c0a-a5fc-431c-9f32-8b17fcd80298   vmm15.mydomain.com Connected
caf84e9f-3e03-4e6f-b0f8-4c5ecec4bef6    vmm18.mydomain.com Connected
18385970-aba6-4fd1-85a6-1b13f663e60b  vmm10.mydomain.com * Disconnected
//server that went bad.*
b152fd82-8213-451f-93c6-353e96aa3be9  vmm102.mydomain.com Connected
//vmm10 but with different name
228a9282-c04e-4229-96a6-67cb47629892 localhost
Connected

On Tue, Jun 11, 2019 at 11:24 AM Adrian Quintero <adrianquintero@gmail.com>
wrote:
...
Strahil,
Looking at your suggestions I think I need to provide a bit more info on
my current setup.
1.
I have 9 hosts in total
   2.
I have 5 storage domains:
   -
hosted_storage (Data Master)
      -
vmstore1 (Data)
      -
data1 (Data)
      -
data2 (Data)
      -
ISO (NFS) //had to create this one because oVirt 4.3.3.1 would not
      let me upload disk images to a data domain without an ISO (I think this is
      due to a bug)
3.
Each volume is of the type “Distributed Replicate” and each one is
   composed of 9 bricks.
   I started with 3 bricks per volume due to the initial Hyperconverged
   setup, then I expanded the cluster and the gluster cluster by 3 hosts at a
   time until I got to a total of 9 hosts.
-
*Disks, bricks and sizes used per volume / dev/sdb engine 100GB / dev/sdb
      vmstore1 2600GB / dev/sdc data1 2600GB / dev/sdd data2 2600GB / dev/sde
      -------- 400GB SSD Used for caching purposes From the above layout a few
      questions came up:*
      1.
*Using the web UI, How can I create a 100GB brick and a 2600GB brick to
         replace the bad bricks for “engine” and “vmstore1” within the same block
         device (sdb) ? What about / dev/sde (caching disk), When I tried creating a
         new brick thru the UI I saw that I could use / dev/sde for caching but only
         for 1 brick (i.e. vmstore1) so if I try to create another brick how would I
         specify it is the same / dev/sde device to be used for caching?*
1.
...
Volumes > select the volume > bricks once in there I can select the 3
   servers that compose the replicated bricks and click remove, this gives a
If I want to remove a brick and it being a replica 3, I go to storage
   pop-up window with the following info:
Are you sure you want to remove the following Brick(s)?
   - vmm11:/gluster_bricks/vmstore1/vmstore1
   - vmm12.virt.iad3p:/gluster_bricks/vmstore1/vmstore1
   - 192.168.0.100:/gluster-bricks/vmstore1/vmstore1
   - Migrate Data from the bricks?
If I proceed with this that means I will have to do this for all the 4
   volumes, that is just not very efficient, but if that is the only way, then
   I am hesitant to put this into a real production environment as there is no
   way I can take that kind of a hit for +500 vms :) and also I wont have
   that much storage or extra volumes to play with in a real sceneario.
2.
After modifying yesterday */ etc/vdsm/vdsm.id <http://vdsm.id> by
   following
   (https://stijn.tintel.eu/blog/2013/03/02/ovirt-problem-duplicate-uuids
   <https://stijn.tintel.eu/blog/2013/03/02/ovirt-problem-duplicate-uuids>) I
   was able to add the server **back **to the cluster using a new fqdn
   and a new IP, and tested replacing one of the bricks and this is my mistake
   as mentioned in #3 above I used / dev/sdb entirely for 1 brick because thru
   the UI I could not separate the block device and be used for 2 bricks (one
   for the engine and one for vmstore1). **So in the “gluster vol info”
   you might see vmm102.mydomain.com <http://vmm102.mydomain.com> *
*but in reality it is myhost1.mydomain.com <http://myhost1.mydomain.com> *
   3.
*I am also attaching gluster_peer_status.txt * *and in the last 2
   entries of that file you will see and entry vmm10.mydomain.com
   <http://vmm10.mydomain.com> (old/bad entry) and vmm102.mydomain.com
   <http://vmm102.mydomain.com> (new entry, same server vmm10, but renamed to
   vmm102). *
*Also please find gluster_vol_info.txt file. *
   4.
*I am ready *
*to redeploy this environment if needed, but I am also ready to test any
   other suggestion. If I can get a good understanding on how to recover from
   this I will be ready to move to production. *
   5.
*Wondering if you’d be willing to have a look at my setup through a shared
   screen? *
*Thanks *
*Adrian*
On Mon, Jun 10, 2019 at 11:41 PM Strahil <hunter86_bg@yahoo.com> wrote:
...
Hi Adrian,
You have several options:
A) If you have space on another gluster volume (or volumes) or on
NFS-based storage, you can migrate all VMs live . Once you do it,  the
simple way will be to stop and remove the storage domain (from UI) and
gluster volume that correspond to the problematic brick. Once gone, you
can  remove the entry in oVirt for the old host and add the newly built
one.Then you can recreate your volume and migrate the data back.
B)  If you don't have space you have to use a more riskier approach
(usually it shouldn't be risky, but I had bad experience in gluster v3):
- New server has same IP and hostname:
Use command line and run the 'gluster volume reset-brick VOLNAME
HOSTNAME:BRICKPATH HOSTNAME:BRICKPATH commit'
Replace VOLNAME with your volume name.
A more practical example would be:
'gluster volume reset-brick data ovirt3:/gluster_bricks/data/brick
ovirt3:/gluster_ ricks/data/brick commit'
If it refuses, then you have to cleanup '/gluster_bricks/data' (which
should be empty).
Also check if the new peer has been probed via 'gluster peer
status'.Check the firewall is allowing gluster communication (you can
compare it to the firewalls on another gluster host).
The automatic healing will kick in 10 minutes (if it succeeds) and will
stress the other 2 replicas, so pick your time properly.
Note: I'm not recommending you to use the 'force' option in the previous
command ... for now :)
- The new server has a different IP/hostname:
Instead of 'reset-brick' you can use  'replace-brick':
It should be like this:
gluster volume replace-brick data old-server:/path/to/brick
new-server:/new/path/to/brick commit force
In both cases check the status via:
gluster volume info VOLNAME
If your cluster is in production , I really recommend you the first
option as it is less risky and the chance for unplanned downtime will be
minimal.
The 'reset-brick'  in your previous e-mail shows that one of the servers
is not connected. Check peer status on all servers, if they are less than
they should check for network and/or firewall issues.
On the new node check if glusterd is enabled and running.
In order to debug - you should provide more info like 'gluster volume
info' and the peer status from each node.
Best Regards,
Strahil Nikolov
On Jun 10, 2019 20:10, Adrian Quintero <adrianquintero@gmail.com> wrote:
...
Can you let me know how to fix the gluster and missing brick?,
I tried removing it by going to "storage > Volumes > vmstore > bricks >
...
However it is showing as an unknown status (which is expected because
...
If i do remove brick: Incorrect bricks selected for removal in
Distributed Replicate volume. Either all the selected bricks should be from
selected the brick
the server was completely wiped) so if I try to "remove", "replace brick"
or "reset brick" it wont work
the same sub volume or one brick each for every sub volume!
...
If I try "replace brick" I cant because I dont have another server with
extra bricks/disks
And if I try "reset brick": Error while executing action Start Gluster
Volume Reset Brick: Volume reset brick commit force failed: rc=-1 out=()
err=['Host myhost1_mydomain_com  not connected']
Are you suggesting to try and fix the gluster using command line?
Note that I cant "peer detach"   the sever , so if I force the removal
of the bricks would I need to force downgrade to replica 2 instead of 3?
what would happen to oVirt as it only supports replica 3?
thanks again.
On Mon, Jun 10, 2019 at 12:52 PM Strahil <hunter86_bg@yahoo.com> wrote:
...
...
Hi Adrian,
Did you fix the issue with the gluster and the missing brick?
If yes, try to set the 'old' host in maintenance an
--
Adrian Quintero
-- 
Adrian Quintero