On 01/29/2014 02:35 PM, Nicolas Ecarnot wrote:
Le 29/01/2014 13:29, Maor Lipchuk a écrit :
> Hi Nicolas,
>
> Can u please attach the VDSM logs of the problematic nodes and valid
> nodes, the engine log and also the sanlock log.
>
> You wrote that many nodes suddenly began to become
> unresponsive,
> Do you mean that the hosts switched to non-responsive status in the
> engine?
> I'm asking that because non-responsive status indicate that the engine
> could not communicate with the hosts, it could be related to sanlock
> since if the host encountered a problem to write to the master domain it
> causes sanlock to restart VDSM and make the hosts non responsive.
non-resposneive for engine is if vdsm is up/responsive.
run locally;
# vdsClient -s 0 getVdsCaps
to check vdsm is ok
>
> regards,
> Maor
It will be hard work to provide these logs but I will try asap.
But to answer your question : the engine saw the failing nodes as
unresponsive, but I was always fully able to ping them and ssh-log on them.
Is there some place I could read further doc about sanlock?
Nicolas Ecarnot
>
> On 01/27/2014 09:26 AM, Nicolas Ecarnot wrote:
>> Le 26/01/2014 23:23, Itamar Heim a écrit :
>>> On 01/20/2014 12:06 PM, Nicolas Ecarnot wrote:
>>>> Hi,
>>>>
>>>> oVirt 3.3, no big issue since the recent snapshot joke, but all in all
>>>> running fine.
>>>>
>>>> All my VM are stored in a iSCSI SAN. The VM usually are using only one
>>>> or two disks (1: system, 2: data) and it is OK.
>>>>
>>>> Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm
>>>> and successfully login to the Lun (session, automatic attach on boot,
>>>> read, write) : nice.
>>>>
>>>> Then after detaching it and shuting down the MV, and for the first
>>>> time,
>>>> I tried to make use of the feature "direct attach" to attach
the disk
>>>> directly from oVirt, login the session via oVirt.
>>>> I connected nice and I saw the disk appear in my VM as /dev/sda or
>>>> whatever. I was able to mount it, read and write.
>>>>
>>>> Then disaster stoke all this : many nodes suddenly began to become
>>>> unresponsive, quickly migrating their VM to the remaining nodes.
>>>> Hopefully, the migrations ran fine and I lost no VM nor downtime,
>>>> but I
>>>> had to reboot every concerned node (other actions failed).
>>>>
>>>> In the failing nodes, /var/log/messages showed the log you can read in
>>>> the end of this message.
>>>> I first get device-mapper warnings, then the host unable to
>>>> collaborate
>>>> with the logical volumes.
>>>>
>>>> The 3 volumes are the three main storage domains, perfectly up and
>>>> running where I store my oVirt VMs.
>>>>
>>>> My reflexions :
>>>> - I'm not sure device-mapper is to blame. I frequently see device
>>>> mapper
>>>> complaining and nothing is getting worse (not oVirt specifically)
>>>> - I have not change my network settings for months (bonding,
>>>> linking...)
>>>> The only new factor is the usage of direct attach LUN.
>>>> - This morning I was able to reproduce the bug, just by trying again
>>>> this attachement, and booting the VM. No mounting of the LUN, just VM
>>>> booting, waiting, and this is enough to crash oVirt.
>>>> - when the disaster happens, usually, amongst the nodes, only three
>>>> nodes gets stroke, the only one that run VMs. Obviously, after
>>>> migration, different nodes are hosting the VMs, and those new nodes
>>>> are
>>>> the one that then get stroke.
>>>>
>>>> This is quite reproductible.
>>>>
>>>> And frightening.
>>>>
>>>>
>>>> The log :
>>>>
>>>> Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36:
>>>> multipath: error getting device
>>>> Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error
>>>> adding
>>>> target to table
>>>> Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36:
>>>> multipath: error getting device
>>>> Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error
>>>> adding
>>>> target to table
>>>> Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR
>>>> Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected
>>>> error#012Traceback (most recent call last):#012 File
>>>> "/usr/share/vdsm/storage/task.py", line 857, in _run#012
return
>>>> fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py",
line
>>>> 45, in
>>>> wrapper#012 res = f(*args, **kwargs)#012 File
>>>> "/usr/share/vdsm/storage/hsm.py", line 3053, in
getVolumeSize#012
>>>> volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py",
line
>>>> 333,
>>>> in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File
>>>> "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012
>>>> domain.getRealDomain()#012 File
"/usr/share/vdsm/storage/sdc.py",
>>>> line
>>>> 52, in getRealDomain#012 return
>>>> self._cache._realProduce(self._sdUUID)#012 File
>>>> "/usr/share/vdsm/storage/sdc.py", line 122, in
_realProduce#012
>>>> domain =
>>>> self._findDomain(sdUUID)#012 File
"/usr/share/vdsm/storage/sdc.py",
>>>> line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File
>>>> "/usr/share/vdsm/storage/blockSD.py", line 1288, in
findDomain#012
>>>> return
>>>> BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012
>>>> File
>>>> "/usr/share/vdsm/storage/blockSD.py", line 414, in
__init__#012
>>>> lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012
>>>> File "/usr/share/vdsm/storage/lvm.py", line 976, in
>>>> checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid:
>>>> %s"
>>>> % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist:
>>>> ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',)
>>>> Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR
>>>> vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the
>>>> volume
>>>> 80bac371-6899-4fbe-a8e1-272037186bfb (domain:
>>>> 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image:
>>>> a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda
>>>> Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR
>>>> Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected
>>>> error#012Traceback (most recent call last):#012 File
>>>> "/usr/share/vdsm/storage/task.py", line 857, in _run#012
return
>>>> fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py",
line
>>>> 45, in
>>>> wrapper#012 res = f(*args, **kwargs)#012 File
>>>> "/usr/share/vdsm/storage/hsm.py", line 3053, in
getVolumeSize#012
>>>> volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py",
line
>>>> 333,
>>>> in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File
>>>> "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012
>>>> domain.getRealDomain()#012 File
"/usr/share/vdsm/storage/sdc.py",
>>>> line
>>>> 52, in getRealDomain#012 return
>>>> self._cache._realProduce(self._sdUUID)#012 File
>>>> "/usr/share/vdsm/storage/sdc.py", line 122, in
_realProduce#012
>>>> domain =
>>>> self._findDomain(sdUUID)#012 File
"/usr/share/vdsm/storage/sdc.py",
>>>> line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File
>>>> "/usr/share/vdsm/storage/blockSD.py", line 1288, in
findDomain#012
>>>> return
>>>> BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012
>>>> File
>>>> "/usr/share/vdsm/storage/blockSD.py", line 414, in
__init__#012
>>>> lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012
>>>> File "/usr/share/vdsm/storage/lvm.py", line 976, in
>>>> checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid:
>>>> %s"
>>>> % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist:
>>>> ('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',)
>>>> Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR
>>>> vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the
>>>> volume
>>>> ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain:
>>>> 1429ffe2-4137-416c-bb38-63fd73f4bcc1 image:
>>>> f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb
>>>> Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR
>>>> Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected
>>>> error#012Traceback (most recent call last):#012 File
>>>> "/usr/share/vdsm/storage/task.py", line 857, in _run#012
return
>>>> fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py",
line
>>>> 45, in
>>>> wrapper#012 res = f(*args, **kwargs)#012 File
>>>> "/usr/share/vdsm/storage/hsm.py", line 3053, in
getVolumeSize#012
>>>> volUUID, bs=1))#012 File "/usr/share/vdsm/storage/volume.py",
line
>>>> 333,
>>>> in getVSize#012 mysd = sdCache.produce(sdUUID=sdUUID)#012 File
>>>> "/usr/share/vdsm/storage/sdc.py", line 98, in produce#012
>>>> domain.getRealDomain()#012 File
"/usr/share/vdsm/storage/sdc.py",
>>>> line
>>>> 52, in getRealDomain#012 return
>>>> self._cache._realProduce(self._sdUUID)#012 File
>>>> "/usr/share/vdsm/storage/sdc.py", line 122, in
_realProduce#012
>>>> domain =
>>>> self._findDomain(sdUUID)#012 File
"/usr/share/vdsm/storage/sdc.py",
>>>> line 141, in _findDomain#012 dom = findMethod(sdUUID)#012 File
>>>> "/usr/share/vdsm/storage/blockSD.py", line 1288, in
findDomain#012
>>>> return
>>>> BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012
>>>> File
>>>> "/usr/share/vdsm/storage/blockSD.py", line 414, in
__init__#012
>>>> lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012
>>>> File "/usr/share/vdsm/storage/lvm.py", line 976, in
>>>> checkVGBlockSizes#012 raise se.VolumeGroupDoesNotExist("vg_uuid:
>>>> %s"
>>>> % vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist:
>>>> ('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)
>>>>
>>>>
>>>>
>>>
>>> was this diagnosed/resolved?
>>
>> - Diagnosed : I discovered no further deeper way to diagnose this issue
>> - Resolved : I found nor received no further way to solve it.
>>
>