Le 26/01/2014 23:23, Itamar Heim a écrit :
On 01/20/2014 12:06 PM, Nicolas Ecarnot wrote:
> Hi,
>
> oVirt 3.3, no big issue since the recent snapshot joke, but all in all
> running fine.
>
> All my VM are stored in a iSCSI SAN. The VM usually are using only one
> or two disks (1: system, 2: data) and it is OK.
>
> Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm
> and successfully login to the Lun (session, automatic attach on boot,
> read, write) : nice.
>
> Then after detaching it and shuting down the MV, and for the first time,
> I tried to make use of the feature "direct attach" to attach the disk
> directly from oVirt, login the session via oVirt.
> I connected nice and I saw the disk appear in my VM as /dev/sda or
> whatever. I was able to mount it, read and write.
>
> Then disaster stoke all this : many nodes suddenly began to become
> unresponsive, quickly migrating their VM to the remaining nodes.
> Hopefully, the migrations ran fine and I lost no VM nor downtime, but I
> had to reboot every concerned node (other actions failed).
>
> In the failing nodes, /var/log/messages showed the log you can read in
> the end of this message.
> I first get device-mapper warnings, then the host unable to collaborate
> with the logical volumes.
>
> The 3 volumes are the three main storage domains, perfectly up and
> running where I store my oVirt VMs.
>
> My reflexions :
> - I'm not sure device-mapper is to blame. I frequently see device mapper
> complaining and nothing is getting worse (not oVirt specifically)
> - I have not change my network settings for months (bonding, linking...)
> The only new factor is the usage of direct attach LUN.
> - This morning I was able to reproduce the bug, just by trying again
> this attachement, and booting the VM. No mounting of the LUN, just VM
> booting, waiting, and this is enough to crash oVirt.
> - when the disaster happens, usually, amongst the nodes, only three
> nodes gets stroke, the only one that run VMs. Obviously, after
> migration, different nodes are hosting the VMs, and those new nodes are
> the one that then get stroke.
>
> This is quite reproductible.
>
> And frightening.
>
>
> The log :
>
> Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36:
> multipath: error getting device
> Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding
> target to table
> Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36:
> multipath: error getting device
> Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error adding
> target to table
> Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR
> Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected
> error#012Traceback (most recent call last):#012 File
> "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return
> fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in
> wrapper#012 res = f(*args, **kwargs)#012 File
> "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012
> volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333,
> in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File
> "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012
> domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line
> 52, in getRealDomain#012 return
> self._cache._realProduce(self._sdUUID)#012 File
> "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain =
> self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py",
> line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File
> "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012
> return
> BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File
> "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012
> lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012
> File "/usr/share/vdsm/storage/lvm.py", line 976, in
> checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s"
> % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist:
> ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',)
> Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR
> vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume
> 80bac371-6899-4fbe-a8e1-272037186bfb (domain:
> 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image:
> a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda
> Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR
> Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected
> error#012Traceback (most recent call last):#012 File
> "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return
> fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in
> wrapper#012 res = f(*args, **kwargs)#012 File
> "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012
> volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333,
> in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File
> "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012
> domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line
> 52, in getRealDomain#012 return
> self._cache._realProduce(self._sdUUID)#012 File
> "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain =
> self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py",
> line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File
> "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012
> return
> BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File
> "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012
> lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012
> File "/usr/share/vdsm/storage/lvm.py", line 976, in
> checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s"
> % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist:
> ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',)
> Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR
> vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the volume
> ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain:
> 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image:
> f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb
> Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR
> Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected
> error#012Traceback (most recent call last):#012 File
> "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return
> fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in
> wrapper#012 res = f(*args, **kwargs)#012 File
> "/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012
> volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py", line 333,
> in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File
> "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012
> domain.getRealDomain()#012 File "/usr/share/vdsm/storage/sdc.py", line
> 52, in getRealDomain#012 return
> self._cache._realProduce(self._sdUUID)#012 File
> "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012 domain =
> self._findDomain(sdUUID)#012 File "/usr/share/vdsm/storage/sdc.py",
> line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File
> "/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012
> return
> BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012 File
> "/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012
> lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012
> File "/usr/share/vdsm/storage/lvm.py", line 976, in
> checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid: %s"
> % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist:
> ('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)
>
>
>
was this diagnosed/resolved?
- Diagnosed : I discovered no further deeper way to diagnose this issue
- Resolved : I found nor received no further way to solve it.
--
Nicolas Ecarnot