I am new to this forum so I apologize before hand if I don’t present the right content you
are looking for or miss content you need.
Background:
By no means am I an expert with Ovirt and glusterfs. That said I have been using,
managing, building out Ovirt (single hosts) and oVirt with Gluster Hyperconverged
environments for 5 years or more.
I started building out Ovirt environments with oVirt Engine Version: 3.6.7.5-1.el6 and
earlier and now I’m using the latest oVirt with Gluster Hyperconverged.
Current hardware and software layout:
For the last 8 months I have been using a oVirt with Gluster Hyperconverged to host in
total about 100 VM’s.
My hardware layout in one environment is 5 Dell R410 3 of them configured with Gluster
Hyperconverged and the other 2 are just added hosts. Below is a detailed list!
Manufacturer: Dell Inc. PowerEdge R410
CPU Model Name: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPU Cores per Socket: 4
CPU Type: Intel Westmere IBRS SSBD Family
Dell PERC H700
4 SAS Seagate 4 TB drives 7.2k
2 one gig links – NIC 1 for frontend and NIC 2 for gluster backend
My software layout is:
OS Version: RHEL - 7 - 7.1908.0.el7.centos
OS Description: CentOS Linux 7 (Core)
Kernel Version: 3.10.0 - 1062.9.1.el7.x86_64
KVM Version: 2.12.0 - 33.1.el7_7.4
LIBVIRT Version: libvirt-4.5.0-23.el7_7.3
VDSM Version: vdsm-4.30.38-1.el7
SPICE Version: 0.14.0 - 7.el7
GlusterFS Version: glusterfs-6.6-1.el7
CEPH Version: librbd1-10.2.5-4.el7
Open vSwitch Version: openvswitch-2.11.0-4.el7
Kernel Features: PTI: 1, IBRS: 0, RETP: 1, SSBD: 3
VNC Encryption: Disabled
My network layout is:
3 HP 3800-48G-4SFP+ Switch (J9576A) running FULL MESH
Issue/timeline:
• All 3 of the HP 3800 were rebooted at the same time and were down for 5 to 10 seconds
before they came back up (meaning pingable and responsive).
• A little more than 85% (36 or so) of the VM’s I had running all went into a pause state,
do to and unknow storage error.
• The gluster volume heal state went all the way up to 2300 on vmstore (OS data location)
• After heal completed on the vmstore (took about an hour) 85% of the VM’s failed to
launch with an error (see below).
VM broadsort is down with error. Exit message: Bad volume specification
{'protocol': 'gluster', 'address': {'function':
'0x0', 'bus': '0x00', 'domain': '0x0000',
'type': 'pci', 'slot': '0x06'}, 'serial':
'b1bf3f56-a453-4383-a350-288bee06445b', 'index': 0, 'iface':
'virtio', 'apparentsize': '274877906944', 'specParams':
{}, 'cache': 'none', 'imageID':
'b1bf3f56-a453-4383-a350-288bee06445b', 'truesize':
'106767498240', 'type': 'disk', 'domainID':
'a7119613-a5ba-4a97-802b-0a985c647381', 'reqsize': '0',
'format': 'raw', 'poolID':
'699fd2d6-c461-11e9-8b83-00163e18a045', 'device': 'disk',
'path':
'vmstore/a7119613-a5ba-4a97-802b-0a985c647381/images/b1bf3f56-a453-4383-a350-288bee06445b/25b0ab77-8f4c-42a1-9416-27db4cd25b39',
'propagateErrors': 'off', 'name': 'vda',
'bootOrder': '1', 'volumeID':
'25b0ab77-8f4c-42a1-9416-27db4cd25b39', 'diskType': 'network',
'alias': 'ua-b1bf3f56-a453-4383-a350-288bee06445b', 'hosts':
[{'name': 'glust01.mydomain.local', 'port': '0'}],
'discard': False}.
Everyone of the VM’s had this same error and I had to find backups and old images to bring
them back online. I deleted some of the corrupted VM’s that I had current images of, to
get them back up.
You shouldn’t be afraid to reboot 1, 2, 3, or even all of your switches at once because of
a human error, power outage, or a simple update. Then have to worry about your VM’s
getting corrupted because of this concerns me greatly. I am now thinking that I didn't
setup oVirt with Gluster Hyperconverged correctly because of this issue. Have I missed
something in the documentation or network layout/setup that would prevent this from
happening again? I have searched the web for a few days now trying to find threads related
to my situation with no luck.
I want to thank you for your time and it is greatly appreciated!
Show replies by date