[Users] Gluster not aligned with double maintenance... problem

Hello, One engine and two hosts all with updated f19 (despite on their names) and ovirt updates-testing repo enabled. So I have 3.3.0.1-1 and vdsm-4.12.1-4 kernel is 3.11.2-201.fc19.x86_64 (problems booting with latest 3.11.4-201.fc19.x86_64) Storage domain configured with gluster as in f19 (3.4.1-1.fc19.x86_64 recompiled binding to port 50152+) and distributed replicated bricks I do this kind of operations: - power off all VMs (to start clean) - put both hosts in maintenance - shutdown both hosts - startup one host - activate one host in webadmin gui after about 2-3 minutes delay it comes up and so it has its own gluster copy active - power on VM and write 3Gb on it [g.cecchi@c6s ~]$ sudo time dd if=/dev/zero bs=1024k count=3096 of=/testfile 3096+0 records in 3096+0 records out 3246391296 bytes (3.2 GB) copied, 42.3414 s, 76.7 MB/s 0.01user 7.99system 0:42.34elapsed 18%CPU (0avgtext+0avgdata 7360maxresident)k 0inputs+6352984outputs (0major+493minor)pagefaults 0swaps Originally the gluster fs had 13Gb used, so now has 16Gb (see /rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata below): [root@f18ovn01 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/fedora-root 15G 5.6G 8.1G 41% / devtmpfs 24G 0 24G 0% /dev tmpfs 24G 8.0K 24G 1% /dev/shm tmpfs 24G 656K 24G 1% /run tmpfs 24G 0 24G 0% /sys/fs/cgroup tmpfs 24G 0 24G 0% /tmp /dev/mapper/3600508b1001037414d4b3039383300021 477M 103M 345M 23% /boot /dev/mapper/fedora-ISO_GLUSTER 10G 33M 10G 1% /gluster/ISO_GLUSTER /dev/mapper/fedora-DATA_GLUSTER 30G 16G 15G 52% /gluster/DATA_GLUSTER f18ovn01.mydomain:gvdata 30G 16G 15G 52% /rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata - power on second host
From a gluster point of view it seems ok [root@f18ovn03 glusterfs]# gluster volume status Status of volume: gviso Gluster process Port Online Pid
Brick f18ovn01.mydomain:/gluster/ISO_GLUSTER/bric k1 50153 Y 1314 Brick f18ovn03.mydomain:/gluster/ISO_GLUSTER/bric k1 50153 Y 1275 NFS Server on localhost 2049 Y 1288 Self-heal Daemon on localhost N/A Y 1295 NFS Server on 192.168.3.1 2049 Y 1328 Self-heal Daemon on 192.168.3.1 N/A Y 1335 There are no active volume tasks Status of volume: gvdata Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick f18ovn01.mydomain:/gluster/DATA_GLUSTER/bri ck1 50152 Y 1313 Brick f18ovn03.mydomaint:/gluster/DATA_GLUSTER/bri ck1 50152 Y 1280 NFS Server on localhost 2049 Y 1288 Self-heal Daemon on localhost N/A Y 1295 NFS Server on 192.168.3.1 2049 Y 1328 Self-heal Daemon on 192.168.3.1 N/A Y 1335 There are no active volume tasks But actually I don't see any network sync and fs remains in fact at 13GB... [root@f18ovn03 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/fedora-root 15G 4.7G 8.9G 35% / devtmpfs 16G 0 16G 0% /dev tmpfs 16G 0 16G 0% /dev/shm tmpfs 16G 560K 16G 1% /run tmpfs 16G 0 16G 0% /sys/fs/cgroup tmpfs 16G 0 16G 0% /tmp /dev/mapper/3600508b1001037424d4b3035343800031 477M 103M 345M 23% /boot /dev/mapper/fedora-DATA_GLUSTER 30G 13G 18G 44% /gluster/DATA_GLUSTER /dev/mapper/fedora-ISO_GLUSTER 10G 33M 10G 1% /gluster/ISO_GLUSTER I wait some minutes but nothing changes... - I activate from webadmin gui this second host and it comes up in Up state But actually it is not synced from a storage point of view so in my opinion it should come up... Now I see on it: [root@f18ovn03 bricks]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/fedora-root 15G 4.7G 8.9G 35% / devtmpfs 16G 0 16G 0% /dev tmpfs 16G 0 16G 0% /dev/shm tmpfs 16G 588K 16G 1% /run tmpfs 16G 0 16G 0% /sys/fs/cgroup tmpfs 16G 0 16G 0% /tmp /dev/mapper/3600508b1001037424d4b3035343800031 477M 103M 345M 23% /boot /dev/mapper/fedora-DATA_GLUSTER 30G 13G 18G 44% /gluster/DATA_GLUSTER /dev/mapper/fedora-ISO_GLUSTER 10G 33M 10G 1% /gluster/ISO_GLUSTER f18ovn01.mydomain:gvdata 30G 13G 18G 44% /rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata So also the /rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata view is incorrect (13Gb instead of 16Gb) I think that oVirt should have some sort of detection of this and avoid activation, or put a warning because in case of f18ovn01 mainteneance it could not becaome SPM. For example oVirt could check heal info. Normally I see no heal for ok volumes eg on a gluster volume (gviso) on this same cluster, planned to be used for iso and with no data I get: [root@f18ovn03 bricks]# gluster volume heal gviso info Gathering Heal info on volume gviso has been successful Brick f18ovn01.mydomain:/gluster/ISO_GLUSTER/brick1 Number of entries: 0 Brick f18ovn03.mydomain:/gluster/ISO_GLUSTER/brick1 Number of entries: 0 Instead on this gvdata one I do get now: [root@f18ovn03 bricks]# gluster volume heal gvdata info Gathering Heal info on volume gvdata has been successful Brick f18ovn01.mydomain:/gluster/DATA_GLUSTER/brick1 Number of entries: 6 /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/images/15f9ca1c-c435-4892-9eb7-0c84583b2a7d/a123801a-0a4d-4a47-a426-99d8480d2e49 /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/images/a5e4f67b-50b5-4740-9990-39deb8812445/53408cb0-bcd4-40de-bc69-89d59b7b5bc2 /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/leases /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/metadata Brick f18ovn03.mydomain:/gluster/DATA_GLUSTER/brick1 Number of entries: 5 /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids <gfid:59a4d113-5881-4147-95c4-b5dee9872ad3> /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md <gfid:db9985f8-117e-4340-98b9-a62bd963f6bf> /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/metadata What do you think about this? What is recommended gluster command to fix f18ovn03? Let me know which kind of ovrt/gluster logs could help. Thanks in advance, Gianluca

On Thu, Oct 17, 2013 at 5:45 PM, Gianluca Cecchi wrote:
Hello, One engine and two hosts all with updated f19 (despite on their names) and ovirt updates-testing repo enabled. So I have 3.3.0.1-1 and vdsm-4.12.1-4 kernel is 3.11.2-201.fc19.x86_64 (problems booting with latest 3.11.4-201.fc19.x86_64) Storage domain configured with gluster as in f19 (3.4.1-1.fc19.x86_64 recompiled binding to port 50152+) and distributed replicated bricks
I do this kind of operations: - power off all VMs (to start clean) - put both hosts in maintenance - shutdown both hosts - startup one host - activate one host in webadmin gui after about 2-3 minutes delay it comes up and so it has its own gluster copy active - power on VM and write 3Gb on it [snip] - power on second host
I missed the iptables rules that contained: # Ports for gluster volume bricks (default 100 ports) -A INPUT -p tcp -m tcp --dport 24009:24108 -j ACCEPT -A INPUT -p tcp -m tcp --dport 49152:49251 -j ACCEPT So I had some gluster communication problems too. I updated it at the moment to # Ports for gluster volume bricks (default 100 ports) -A INPUT -p tcp -m tcp --dport 24009:24108 -j ACCEPT -A INPUT -p tcp -m tcp --dport 49152:49251 -j ACCEPT -A INPUT -p tcp -m tcp --dport 50152:50251 -j ACCEPT until libvirt for fedora19 get patched as upstream I put in iptables these lines # Ports for gluster volume bricks (default 100 ports) -A INPUT -p tcp -m tcp --dport 24009:24108 -j ACCEPT -A INPUT -p tcp -m tcp --dport 49152:49251 -j ACCEPT -A INPUT -p tcp -m tcp --dport 50152:50251 -j ACCEPT So that I can test both gluster and live migration. I'm going to retest completely and see how it behaves. Gianluca
participants (1)
-
Gianluca Cecchi