oVirt and Gluster Hyperconverged - VM's Bad volume specification - Devel

9 Jan 2020

      I am new to this forum so I apologize before hand if I don’t present right content correctly or miss the content you need. 

Background:  
By no means am I an expert with Ovirt and glusterfs.  That said I have been using, managing, building out Ovirt (single hosts) and oVirt with Gluster Hyperconverged environments for 5 years or more. 
I started building out Ovirt environments with oVirt Engine Version: 3.6.7.5-1.el6 and earlier and now I’m using the latest oVirt with Gluster Hyperconverged. 

Current hardware and software layout:
For the last 8 months I have been using a oVirt with Gluster Hyperconverged to host in total about 100 VM’s. 
My hardware layout in one environment is 5 Dell R410 3 of them configured with Gluster Hyperconverged and the other 2 are just added hosts. Below is a detailed list!
Manufacturer: Dell Inc. PowerEdge R410
CPU Model Name: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPU Cores per Socket: 4
CPU Type: Intel Westmere IBRS SSBD Family
Dell PERC H700
4 SAS Seagate 4 TB drives 7.2k
2 one gig links – NIC 1 for frontend and NIC 2 for gluster backend 

My software layout is: 
OS Version: RHEL - 7 - 7.1908.0.el7.centos
OS Description: CentOS Linux 7 (Core)
Kernel Version: 3.10.0 - 1062.9.1.el7.x86_64
KVM Version: 2.12.0 - 33.1.el7_7.4
LIBVIRT Version: libvirt-4.5.0-23.el7_7.3
VDSM Version: vdsm-4.30.38-1.el7
SPICE Version: 0.14.0 - 7.el7
GlusterFS Version: glusterfs-6.6-1.el7
CEPH Version: librbd1-10.2.5-4.el7
Open vSwitch Version: openvswitch-2.11.0-4.el7
Kernel Features: PTI: 1, IBRS: 0, RETP: 1, SSBD: 3
VNC Encryption: Disabled

My network layout is:
3 HP 3800-48G-4SFP+ Switch (J9576A) running FULL MESH

Issue/timeline: 
•	All 3 of the HP 3800 were rebooted at the same time and were down for 5 to 10 seconds before they came back up (meaning pingable and responsive). 
•	A little more than 85% (36 or so) of the VM’s I had running all went into a pause state, do to and unknow storage error. 
•	The gluster volume heal state went all the way up to 2300 on vmstore (OS data location)
•	After heal completed on the vmstore (took about an hour) 85% of the VM’s failed to launch with an error (see below).  

VM broadsort is down with error. Exit message: Bad volume specification {'protocol': 'gluster', 'address': {'function': '0x0', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'slot': '0x06'}, 'serial': 'b1bf3f56-a453-4383-a350-288bee06445b', 'index': 0, 'iface': 'virtio', 'apparentsize': '274877906944', 'specParams': {}, 'cache': 'none', 'imageID': 'b1bf3f56-a453-4383-a350-288bee06445b', 'truesize': '106767498240', 'type': 'disk', 'domainID': 'a7119613-a5ba-4a97-802b-0a985c647381', 'reqsize': '0', 'format': 'raw', 'poolID': '699fd2d6-c461-11e9-8b83-00163e18a045', 'device': 'disk', 'path': 'vmstore/a7119613-a5ba-4a97-802b-0a985c647381/images/b1bf3f56-a453-4383-a350-288bee06445b/25b0ab77-8f4c-42a1-9416-27db4cd25b39', 'propagateErrors': 'off', 'name': 'vda', 'bootOrder': '1', 'volumeID': '25b0ab77-8f4c-42a1-9416-27db4cd25b39', 'diskType': 'network', 'alias': 'ua-b1bf3f56-a453-4383-a350-288bee06445b', 'hosts': [{'name': 'glust01.mydomain.local', 'port': '0'}], 'discard': False}.

Everyone of the VM’s had this same error and I had to find backups and old images to bring them back online. I deleted some of the VM’s that I had current images of to get them back up.

You shouldn’t be afraid to reboot 1, 2, 3, or even all of your switches at once because of a human error, power outage, or a simple update. Then have to worry about your VM’s getting corrupted because of this concerns me greatly that I didn’t setup oVirt with Gluster Hyperconverged correctly. Have I missed something in the documentation or network layout that would prevent this for happing? 

I want to thank you for your time and it is greatly appreciated!

oVirt and Gluster Hyperconverged - VM's Bad volume specification

thomas.rockey＠datasphere.com

tags

participants (1)