[Users] "Volume Group does not exist". Blame device-mapper ?

Hi, oVirt 3.3, no big issue since the recent snapshot joke, but all in all running fine. All my VM are stored in a iSCSI SAN. The VM usually are using only one or two disks (1: system, 2: data) and it is OK. Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm and successfully login to the Lun (session, automatic attach on boot, read, write) : nice. Then after detaching it and shuting down the MV, and for the first time, I tried to make use of the feature "direct attach" to attach the disk directly from oVirt, login the session via oVirt. I connected nice and I saw the disk appear in my VM as /dev/sda or whatever. I was able to mount it, read and write. Then disaster stoke all this : many nodes suddenly began to become unresponsive, quickly migrating their VM to the remaining nodes. Hopefully, the migrations ran fine and I lost no VM nor downtime, but I had to reboot every concerned node (other actions failed). In the failing nodes, /var/log/messages showed the log you can read in the end of this message. I first get device-mapper warnings, then the host unable to collaborate with the logical volumes. The 3 volumes are the three main storage domains, perfectly up and running where I store my oVirt VMs. My reflexions : - I'm not sure device-mapper is to blame. I frequently see device mapper complaining and nothing is getting worse (not oVirt specifically) - I have not change my network settings for months (bonding, linking...) The only new factor is the usage of direct attach LUN. - This morning I was able to reproduce the bug, just by trying again this attachement, and booting the VM. No mounting of the LUN, just VM booting, waiting, and this is enough to crash oVirt. - when the disaster happens, usually, amongst the nodes, only three nodes gets stroke, the only one that run VMs. Obviously, after migration, different nodes are hosting the VMs, and those new nodes are the one that then get stroke. This is quite reproductible. And frightening. The log : Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume 80bac371-6899-4fbe-a8e1-272037186bfb (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',) -- Nicolas Ecarnot

On 01/20/2014 12:06 PM, Nicolas Ecarnot wrote:
Hi,
oVirt 3.3, no big issue since the recent snapshot joke, but all in all running fine.
All my VM are stored in a iSCSI SAN. The VM usually are using only one or two disks (1: system, 2: data) and it is OK.
Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm and successfully login to the Lun (session, automatic attach on boot, read, write) : nice.
Then after detaching it and shuting down the MV, and for the first time, I tried to make use of the feature "direct attach" to attach the disk directly from oVirt, login the session via oVirt. I connected nice and I saw the disk appear in my VM as /dev/sda or whatever. I was able to mount it, read and write.
Then disaster stoke all this : many nodes suddenly began to become unresponsive, quickly migrating their VM to the remaining nodes. Hopefully, the migrations ran fine and I lost no VM nor downtime, but I had to reboot every concerned node (other actions failed).
In the failing nodes, /var/log/messages showed the log you can read in the end of this message. I first get device-mapper warnings, then the host unable to collaborate with the logical volumes.
The 3 volumes are the three main storage domains, perfectly up and running where I store my oVirt VMs.
My reflexions : - I'm not sure device-mapper is to blame. I frequently see device mapper complaining and nothing is getting worse (not oVirt specifically) - I have not change my network settings for months (bonding, linking...) The only new factor is the usage of direct attach LUN. - This morning I was able to reproduce the bug, just by trying again this attachement, and booting the VM. No mounting of the LUN, just VM booting, waiting, and this is enough to crash oVirt. - when the disaster happens, usually, amongst the nodes, only three nodes gets stroke, the only one that run VMs. Obviously, after migration, different nodes are hosting the VMs, and those new nodes are the one that then get stroke.
This is quite reproductible.
And frightening.
The log :
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume 80bac371-6899-4fbe-a8e1-272037186bfb (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)
was this diagnosed/resolved?

Le 26/01/2014 23:23, Itamar Heim a écrit :
On 01/20/2014 12:06 PM, Nicolas Ecarnot wrote:
Hi,
oVirt 3.3, no big issue since the recent snapshot joke, but all in all running fine.
All my VM are stored in a iSCSI SAN. The VM usually are using only one or two disks (1: system, 2: data) and it is OK.
Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm and successfully login to the Lun (session, automatic attach on boot, read, write) : nice.
Then after detaching it and shuting down the MV, and for the first time, I tried to make use of the feature "direct attach" to attach the disk directly from oVirt, login the session via oVirt. I connected nice and I saw the disk appear in my VM as /dev/sda or whatever. I was able to mount it, read and write.
Then disaster stoke all this : many nodes suddenly began to become unresponsive, quickly migrating their VM to the remaining nodes. Hopefully, the migrations ran fine and I lost no VM nor downtime, but I had to reboot every concerned node (other actions failed).
In the failing nodes, /var/log/messages showed the log you can read in the end of this message. I first get device-mapper warnings, then the host unable to collaborate with the logical volumes.
The 3 volumes are the three main storage domains, perfectly up and running where I store my oVirt VMs.
My reflexions : - I'm not sure device-mapper is to blame. I frequently see device mapper complaining and nothing is getting worse (not oVirt specifically) - I have not change my network settings for months (bonding, linking...) The only new factor is the usage of direct attach LUN. - This morning I was able to reproduce the bug, just by trying again this attachement, and booting the VM. No mounting of the LUN, just VM booting, waiting, and this is enough to crash oVirt. - when the disaster happens, usually, amongst the nodes, only three nodes gets stroke, the only one that run VMs. Obviously, after migration, different nodes are hosting the VMs, and those new nodes are the one that then get stroke.
This is quite reproductible.
And frightening.
The log :
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume 80bac371-6899-4fbe-a8e1-272037186bfb (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)
was this diagnosed/resolved?
- Diagnosed : I discovered no further deeper way to diagnose this issue - Resolved : I found nor received no further way to solve it. -- Nicolas Ecarnot

Hi Nicolas, Can u please attach the VDSM logs of the problematic nodes and valid nodes, the engine log and also the sanlock log. You wrote that many nodes suddenly began to become unresponsive, Do you mean that the hosts switched to non-responsive status in the engine? I'm asking that because non-responsive status indicate that the engine could not communicate with the hosts, it could be related to sanlock since if the host encountered a problem to write to the master domain it causes sanlock to restart VDSM and make the hosts non responsive. regards, Maor On 01/27/2014 09:26 AM, Nicolas Ecarnot wrote:
Le 26/01/2014 23:23, Itamar Heim a écrit :
On 01/20/2014 12:06 PM, Nicolas Ecarnot wrote:
Hi,
oVirt 3.3, no big issue since the recent snapshot joke, but all in all running fine.
All my VM are stored in a iSCSI SAN. The VM usually are using only one or two disks (1: system, 2: data) and it is OK.
Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm and successfully login to the Lun (session, automatic attach on boot, read, write) : nice.
Then after detaching it and shuting down the MV, and for the first time, I tried to make use of the feature "direct attach" to attach the disk directly from oVirt, login the session via oVirt. I connected nice and I saw the disk appear in my VM as /dev/sda or whatever. I was able to mount it, read and write.
Then disaster stoke all this : many nodes suddenly began to become unresponsive, quickly migrating their VM to the remaining nodes. Hopefully, the migrations ran fine and I lost no VM nor downtime, but I had to reboot every concerned node (other actions failed).
In the failing nodes, /var/log/messages showed the log you can read in the end of this message. I first get device-mapper warnings, then the host unable to collaborate with the logical volumes.
The 3 volumes are the three main storage domains, perfectly up and running where I store my oVirt VMs.
My reflexions : - I'm not sure device-mapper is to blame. I frequently see device mapper complaining and nothing is getting worse (not oVirt specifically) - I have not change my network settings for months (bonding, linking...) The only new factor is the usage of direct attach LUN. - This morning I was able to reproduce the bug, just by trying again this attachement, and booting the VM. No mounting of the LUN, just VM booting, waiting, and this is enough to crash oVirt. - when the disaster happens, usually, amongst the nodes, only three nodes gets stroke, the only one that run VMs. Obviously, after migration, different nodes are hosting the VMs, and those new nodes are the one that then get stroke.
This is quite reproductible.
And frightening.
The log :
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume 80bac371-6899-4fbe-a8e1-272037186bfb (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)
was this diagnosed/resolved?
- Diagnosed : I discovered no further deeper way to diagnose this issue - Resolved : I found nor received no further way to solve it.

Le 29/01/2014 13:29, Maor Lipchuk a écrit :
Hi Nicolas,
Can u please attach the VDSM logs of the problematic nodes and valid nodes, the engine log and also the sanlock log.
You wrote that many nodes suddenly began to become unresponsive, Do you mean that the hosts switched to non-responsive status in the engine? I'm asking that because non-responsive status indicate that the engine could not communicate with the hosts, it could be related to sanlock since if the host encountered a problem to write to the master domain it causes sanlock to restart VDSM and make the hosts non responsive.
regards, Maor
It will be hard work to provide these logs but I will try asap. But to answer your question : the engine saw the failing nodes as unresponsive, but I was always fully able to ping them and ssh-log on them. Is there some place I could read further doc about sanlock? Nicolas Ecarnot
On 01/27/2014 09:26 AM, Nicolas Ecarnot wrote:
Le 26/01/2014 23:23, Itamar Heim a écrit :
On 01/20/2014 12:06 PM, Nicolas Ecarnot wrote:
Hi,
oVirt 3.3, no big issue since the recent snapshot joke, but all in all running fine.
All my VM are stored in a iSCSI SAN. The VM usually are using only one or two disks (1: system, 2: data) and it is OK.
Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm and successfully login to the Lun (session, automatic attach on boot, read, write) : nice.
Then after detaching it and shuting down the MV, and for the first time, I tried to make use of the feature "direct attach" to attach the disk directly from oVirt, login the session via oVirt. I connected nice and I saw the disk appear in my VM as /dev/sda or whatever. I was able to mount it, read and write.
Then disaster stoke all this : many nodes suddenly began to become unresponsive, quickly migrating their VM to the remaining nodes. Hopefully, the migrations ran fine and I lost no VM nor downtime, but I had to reboot every concerned node (other actions failed).
In the failing nodes, /var/log/messages showed the log you can read in the end of this message. I first get device-mapper warnings, then the host unable to collaborate with the logical volumes.
The 3 volumes are the three main storage domains, perfectly up and running where I store my oVirt VMs.
My reflexions : - I'm not sure device-mapper is to blame. I frequently see device mapper complaining and nothing is getting worse (not oVirt specifically) - I have not change my network settings for months (bonding, linking...) The only new factor is the usage of direct attach LUN. - This morning I was able to reproduce the bug, just by trying again this attachement, and booting the VM. No mounting of the LUN, just VM booting, waiting, and this is enough to crash oVirt. - when the disaster happens, usually, amongst the nodes, only three nodes gets stroke, the only one that run VMs. Obviously, after migration, different nodes are hosting the VMs, and those new nodes are the one that then get stroke.
This is quite reproductible.
And frightening.
The log :
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume 80bac371-6899-4fbe-a8e1-272037186bfb (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)
was this diagnosed/resolved?
- Diagnosed : I discovered no further deeper way to diagnose this issue - Resolved : I found nor received no further way to solve it.
-- Nicolas Ecarnot

On 01/29/2014 02:35 PM, Nicolas Ecarnot wrote:
Le 29/01/2014 13:29, Maor Lipchuk a écrit :
Hi Nicolas,
Can u please attach the VDSM logs of the problematic nodes and valid nodes, the engine log and also the sanlock log.
You wrote that many nodes suddenly began to become unresponsive, Do you mean that the hosts switched to non-responsive status in the engine? I'm asking that because non-responsive status indicate that the engine could not communicate with the hosts, it could be related to sanlock since if the host encountered a problem to write to the master domain it causes sanlock to restart VDSM and make the hosts non responsive.
non-resposneive for engine is if vdsm is up/responsive. run locally; # vdsClient -s 0 getVdsCaps to check vdsm is ok
regards, Maor
It will be hard work to provide these logs but I will try asap. But to answer your question : the engine saw the failing nodes as unresponsive, but I was always fully able to ping them and ssh-log on them.
Is there some place I could read further doc about sanlock?
Nicolas Ecarnot
On 01/27/2014 09:26 AM, Nicolas Ecarnot wrote:
Le 26/01/2014 23:23, Itamar Heim a écrit :
On 01/20/2014 12:06 PM, Nicolas Ecarnot wrote:
Hi,
oVirt 3.3, no big issue since the recent snapshot joke, but all in all running fine.
All my VM are stored in a iSCSI SAN. The VM usually are using only one or two disks (1: system, 2: data) and it is OK.
Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm and successfully login to the Lun (session, automatic attach on boot, read, write) : nice.
Then after detaching it and shuting down the MV, and for the first time, I tried to make use of the feature "direct attach" to attach the disk directly from oVirt, login the session via oVirt. I connected nice and I saw the disk appear in my VM as /dev/sda or whatever. I was able to mount it, read and write.
Then disaster stoke all this : many nodes suddenly began to become unresponsive, quickly migrating their VM to the remaining nodes. Hopefully, the migrations ran fine and I lost no VM nor downtime, but I had to reboot every concerned node (other actions failed).
In the failing nodes, /var/log/messages showed the log you can read in the end of this message. I first get device-mapper warnings, then the host unable to collaborate with the logical volumes.
The 3 volumes are the three main storage domains, perfectly up and running where I store my oVirt VMs.
My reflexions : - I'm not sure device-mapper is to blame. I frequently see device mapper complaining and nothing is getting worse (not oVirt specifically) - I have not change my network settings for months (bonding, linking...) The only new factor is the usage of direct attach LUN. - This morning I was able to reproduce the bug, just by trying again this attachement, and booting the VM. No mounting of the LUN, just VM booting, waiting, and this is enough to crash oVirt. - when the disaster happens, usually, amongst the nodes, only three nodes gets stroke, the only one that run VMs. Obviously, after migration, different nodes are hosting the VMs, and those new nodes are the one that then get stroke.
This is quite reproductible.
And frightening.
The log :
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume 80bac371-6899-4fbe-a8e1-272037186bfb (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)
was this diagnosed/resolved?
- Diagnosed : I discovered no further deeper way to diagnose this issue - Resolved : I found nor received no further way to solve it.

Le 29/01/2014 13:36, Itamar Heim a écrit :
On 01/29/2014 02:35 PM, Nicolas Ecarnot wrote:
Le 29/01/2014 13:29, Maor Lipchuk a écrit :
Hi Nicolas,
Can u please attach the VDSM logs of the problematic nodes and valid nodes, the engine log and also the sanlock log.
You wrote that many nodes suddenly began to become unresponsive, Do you mean that the hosts switched to non-responsive status in the engine? I'm asking that because non-responsive status indicate that the engine could not communicate with the hosts, it could be related to sanlock since if the host encountered a problem to write to the master domain it causes sanlock to restart VDSM and make the hosts non responsive.
non-resposneive for engine is if vdsm is up/responsive. run locally; # vdsClient -s 0 getVdsCaps
to check vdsm is ok
When I find the time for it, I'll reproduce the crash and run this command and let you know. I must admit this was scary. -- Nicolas Ecarnot

Hi, Please see inline response. Regards, Maor On 01/29/2014 02:35 PM, Nicolas Ecarnot wrote:
Le 29/01/2014 13:29, Maor Lipchuk a écrit :
Hi Nicolas,
Can u please attach the VDSM logs of the problematic nodes and valid nodes, the engine log and also the sanlock log.
You wrote that many nodes suddenly began to become unresponsive, Do you mean that the hosts switched to non-responsive status in the engine? I'm asking that because non-responsive status indicate that the engine could not communicate with the hosts, it could be related to sanlock since if the host encountered a problem to write to the master domain it causes sanlock to restart VDSM and make the hosts non responsive.
regards, Maor
It will be hard work to provide these logs but I will try asap. But to answer your question : the engine saw the failing nodes as unresponsive, but I was always fully able to ping them and ssh-log on them. Sorry, I was not that clear, as Itamar wrote before, vdsClient -s 0 getVdsCaps should indicate if the Host is non responsive or not. If the VDSM service is down or the host could not be reached then the host will be non responsive as well.
Is there some place I could read further doc about sanlock? you can check man sanlock or https://fedorahosted.org/sanlock/ and http://www.ovirt.org/SANLock
Nicolas Ecarnot
On 01/27/2014 09:26 AM, Nicolas Ecarnot wrote:
Le 26/01/2014 23:23, Itamar Heim a écrit :
On 01/20/2014 12:06 PM, Nicolas Ecarnot wrote:
Hi,
oVirt 3.3, no big issue since the recent snapshot joke, but all in all running fine.
All my VM are stored in a iSCSI SAN. The VM usually are using only one or two disks (1: system, 2: data) and it is OK.
Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm and successfully login to the Lun (session, automatic attach on boot, read, write) : nice.
Then after detaching it and shuting down the MV, and for the first time, I tried to make use of the feature "direct attach" to attach the disk directly from oVirt, login the session via oVirt. I connected nice and I saw the disk appear in my VM as /dev/sda or whatever. I was able to mount it, read and write.
Then disaster stoke all this : many nodes suddenly began to become unresponsive, quickly migrating their VM to the remaining nodes. Hopefully, the migrations ran fine and I lost no VM nor downtime, but I had to reboot every concerned node (other actions failed).
In the failing nodes, /var/log/messages showed the log you can read in the end of this message. I first get device-mapper warnings, then the host unable to collaborate with the logical volumes.
The 3 volumes are the three main storage domains, perfectly up and running where I store my oVirt VMs.
My reflexions : - I'm not sure device-mapper is to blame. I frequently see device mapper complaining and nothing is getting worse (not oVirt specifically) - I have not change my network settings for months (bonding, linking...) The only new factor is the usage of direct attach LUN. - This morning I was able to reproduce the bug, just by trying again this attachement, and booting the VM. No mounting of the LUN, just VM booting, waiting, and this is enough to crash oVirt. - when the disaster happens, usually, amongst the nodes, only three nodes gets stroke, the only one that run VMs. Obviously, after migration, different nodes are hosting the VMs, and those new nodes are the one that then get stroke.
This is quite reproductible.
And frightening.
The log :
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36: multipath: error getting device Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding target to table Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume 80bac371-6899-4fbe-a8e1-272037186bfb (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',) Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain: 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image: f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012 volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333, in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012 domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain#012 return self._cache._realProduce(self._sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain = self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012 return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012 lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012 File "/usr/share/vdsm/storage/lvm.py", line 976, in checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s" % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist: ('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)
was this diagnosed/resolved?
- Diagnosed : I discovered no further deeper way to diagnose this issue - Resolved : I found nor received no further way to solve it.

Hi Nicolas, are you still able to reproduce this issue? Are you using fedora or centos? If providing the logs is problematic for you could you try to ping me on irc (fsimonce on #ovirt OFTC) so that we can work on the issue together? Thanks, -- Federico ----- Original Message -----
From: "Nicolas Ecarnot" <nicolas@ecarnot.net> To: "users" <users@ovirt.org> Sent: Monday, January 20, 2014 11:06:21 AM Subject: [Users] "Volume Group does not exist". Blame device-mapper ?
Hi,
oVirt 3.3, no big issue since the recent snapshot joke, but all in all running fine.
All my VM are stored in a iSCSI SAN. The VM usually are using only one or two disks (1: system, 2: data) and it is OK.
Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm and successfully login to the Lun (session, automatic attach on boot, read, write) : nice.
Then after detaching it and shuting down the MV, and for the first time, I tried to make use of the feature "direct attach" to attach the disk directly from oVirt, login the session via oVirt. I connected nice and I saw the disk appear in my VM as /dev/sda or whatever. I was able to mount it, read and write.
Then disaster stoke all this : many nodes suddenly began to become unresponsive, quickly migrating their VM to the remaining nodes. Hopefully, the migrations ran fine and I lost no VM nor downtime, but I had to reboot every concerned node (other actions failed).
In the failing nodes, /var/log/messages showed the log you can read in the end of this message. I first get device-mapper warnings, then the host unable to collaborate with the logical volumes.
The 3 volumes are the three main storage domains, perfectly up and running where I store my oVirt VMs.
My reflexions : - I'm not sure device-mapper is to blame. I frequently see device mapper complaining and nothing is getting worse (not oVirt specifically) - I have not change my network settings for months (bonding, linking...) The only new factor is the usage of direct attach LUN. - This morning I was able to reproduce the bug, just by trying again this attachement, and booting the VM. No mounting of the LUN, just VM booting, waiting, and this is enough to crash oVirt. - when the disaster happens, usually, amongst the nodes, only three nodes gets stroke, the only one that run VMs. Obviously, after migration, different nodes are hosting the VMs, and those new nodes are the one that then get stroke.
This is quite reproductible.
And frightening.

Le 14/02/2014 15:39, Federico Simoncelli a écrit :
Hi Nicolas, are you still able to reproduce this issue? Are you using fedora or centos?
If providing the logs is problematic for you could you try to ping me on irc (fsimonce on #ovirt OFTC) so that we can work on the issue together?
Thanks,
Hi Frederico, Since I haven't changed anything related to the SAN or the network, I'm pretty sure I'll be able to reproduce the bug. We are using CentOS. I can provide the logs, no issue. This week, our oVirt setup will be strongly used, so this is not the better time to play with it. I'm very thankful you took the time to answer, but may I delay my answer about this bug to next week? -- Nicolas Ecarnot

----- Original Message -----
From: "Nicolas Ecarnot" <nicolas@ecarnot.net> To: "Federico Simoncelli" <fsimonce@redhat.com> Cc: "users" <users@ovirt.org> Sent: Monday, February 17, 2014 10:14:56 AM Subject: Re: [Users] "Volume Group does not exist". Blame device-mapper ?
Le 14/02/2014 15:39, Federico Simoncelli a écrit :
Hi Nicolas, are you still able to reproduce this issue? Are you using fedora or centos?
If providing the logs is problematic for you could you try to ping me on irc (fsimonce on #ovirt OFTC) so that we can work on the issue together?
Thanks,
Hi Frederico,
Since I haven't changed anything related to the SAN or the network, I'm pretty sure I'll be able to reproduce the bug. We are using CentOS. I can provide the logs, no issue.
This week, our oVirt setup will be strongly used, so this is not the better time to play with it. I'm very thankful you took the time to answer, but may I delay my answer about this bug to next week?
Ok, no problem. Feel free to contact me on IRC when you start testing. -- Federico
participants (4)
-
Federico Simoncelli
-
Itamar Heim
-
Maor Lipchuk
-
Nicolas Ecarnot