[Users] "Volume Group does not exist". Blame device-mapper ?

20 Jan 2014

      Hi,

oVirt 3.3, no big issue since the recent snapshot joke, but all in all 
running fine.

All my VM are stored in a iSCSI SAN. The VM usually are using only one 
or two disks (1: system, 2: data) and it is OK.

Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm 
and successfully login to the Lun (session, automatic attach on boot, 
read, write) : nice.

Then after detaching it and shuting down the MV, and for the first time, 
I tried to make use of the feature "direct attach" to attach the disk 
directly from oVirt, login the session via oVirt.
I connected nice and I saw the disk appear in my VM as /dev/sda or 
whatever. I was able to mount it, read and write.

Then disaster stoke all this : many nodes suddenly began to become 
unresponsive, quickly migrating their VM to the remaining nodes.
Hopefully, the migrations ran fine and I lost no VM nor downtime, but I 
had to reboot every concerned node (other actions failed).

In the failing nodes, /var/log/messages showed the log you can read in 
the end of this message.
I first get device-mapper warnings, then the host unable to collaborate 
with the logical volumes.

The 3 volumes are the three main storage domains, perfectly up and 
running where I store my oVirt VMs.

My reflexions :
- I'm not sure device-mapper is to blame. I frequently see device mapper 
complaining and nothing is getting worse (not oVirt specifically)
- I have not change my network settings for months (bonding, linking...) 
The only new factor is the usage of direct attach LUN.
- This morning I was able to reproduce the bug, just by trying again 
this attachement, and booting the VM. No mounting of the LUN, just VM 
booting, waiting, and this is enough to crash oVirt.
- when the disaster happens, usually, amongst the nodes, only three 
nodes gets stroke, the only one that run VMs. Obviously, after 
migration, different nodes are hosting the VMs, and those new nodes are 
the one that then get stroke.

This is quite reproductible.

And frightening.

The log :

Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: 
multipath: error getting device
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding 
target to table
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: 
multipath: error getting device
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding 
target to table
Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR 
Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected 
error#012Traceback (most recent call last):#012  File 
"/usr/share/vdsm/storage/task.py", line 857, in _run#012    return 
fn(*args, **kargs)#012  File "/usr/share/vdsm/logUtils.py", line 45, in 
wrapper#012    res = f(*args, **kwargs)#012  File 
"/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 
volUUID, bs=1))#012  File "/usr/share/vdsm/storage/volume.py", line 333, 
in getVSize#012    mysd = sdCache.produce(sdUUID=sdUUID)#012  File 
"/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 
domain.getRealDomain()#012  File "/usr/share/vdsm/storage/sdc.py", line 
52, in getRealDomain#012    return 
self._cache._realProduce(self._sdUUID)#012  File 
"/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 
domain = self._findDomain(sdUUID)#012  File 
"/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012    dom = 
findMethod(sdUUID)#012  File "/usr/share/vdsm/storage/blockSD.py", line 
1288, in findDomain#012    return 
BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012  File 
"/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 
lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 
File "/usr/share/vdsm/storage/lvm.py", line 976, in 
checkVGBlockSizes#012    raise se.VolumeGroupDoesNotExist("vg_uuid: %s" 
% vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: 
('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',)
Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR 
vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume 
80bac371-6899-4fbe-a8e1-272037186bfb (domain: 
1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: 
a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda
Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR 
Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected 
error#012Traceback (most recent call last):#012  File 
"/usr/share/vdsm/storage/task.py", line 857, in _run#012    return 
fn(*args, **kargs)#012  File "/usr/share/vdsm/logUtils.py", line 45, in 
wrapper#012    res = f(*args, **kwargs)#012  File 
"/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 
volUUID, bs=1))#012  File "/usr/share/vdsm/storage/volume.py", line 333, 
in getVSize#012    mysd = sdCache.produce(sdUUID=sdUUID)#012  File 
"/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 
domain.getRealDomain()#012  File "/usr/share/vdsm/storage/sdc.py", line 
52, in getRealDomain#012    return 
self._cache._realProduce(self._sdUUID)#012  File 
"/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 
domain = self._findDomain(sdUUID)#012  File 
"/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012    dom = 
findMethod(sdUUID)#012  File "/usr/share/vdsm/storage/blockSD.py", line 
1288, in findDomain#012    return 
BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012  File 
"/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 
lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 
File "/usr/share/vdsm/storage/lvm.py", line 976, in 
checkVGBlockSizes#012    raise se.VolumeGroupDoesNotExist("vg_uuid: %s" 
% vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: 
('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',)
Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR 
vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume 
ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain: 
1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: 
f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb
Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR 
Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected 
error#012Traceback (most recent call last):#012  File 
"/usr/share/vdsm/storage/task.py", line 857, in _run#012    return 
fn(*args, **kargs)#012  File "/usr/share/vdsm/logUtils.py", line 45, in 
wrapper#012    res = f(*args, **kwargs)#012  File 
"/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 
volUUID, bs=1))#012  File "/usr/share/vdsm/storage/volume.py", line 333, 
in getVSize#012    mysd = sdCache.produce(sdUUID=sdUUID)#012  File 
"/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 
domain.getRealDomain()#012  File "/usr/share/vdsm/storage/sdc.py", line 
52, in getRealDomain#012    return 
self._cache._realProduce(self._sdUUID)#012  File 
"/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 
domain = self._findDomain(sdUUID)#012  File 
"/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012    dom = 
findMethod(sdUUID)#012  File "/usr/share/vdsm/storage/blockSD.py", line 
1288, in findDomain#012    return 
BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012  File 
"/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 
lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 
File "/usr/share/vdsm/storage/lvm.py", line 976, in 
checkVGBlockSizes#012    raise se.VolumeGroupDoesNotExist("vg_uuid: %s" 
% vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: 
('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)

-- 
Nicolas Ecarnot

Nicolas Ecarnot

Itamar Heim

Nicolas Ecarnot

Maor Lipchuk

Nicolas Ecarnot

Itamar Heim

Nicolas Ecarnot

Maor Lipchuk

Federico Simoncelli

Nicolas Ecarnot

Federico Simoncelli

tags

participants (4)