From: "Boyan Tabakov" <blade(a)alslayer.net>
To: "Nir Soffer" <nsoffer(a)redhat.com>
Cc: users(a)ovirt.org
Sent: Wednesday, March 5, 2014 3:38:25 PM
Subject: Re: [Users] SD Disk's Logical Volume not visible/activated on some nodes
Hello Nir,
On Wed Mar 5 14:37:17 2014, Nir Soffer wrote:
> ----- Original Message -----
>> From: "Boyan Tabakov" <blade(a)alslayer.net>
>> To: "Nir Soffer" <nsoffer(a)redhat.com>
>> Cc: users(a)ovirt.org
>> Sent: Tuesday, March 4, 2014 3:53:24 PM
>> Subject: Re: [Users] SD Disk's Logical Volume not visible/activated on
>> some nodes
>>
>> On Tue Mar 4 14:46:33 2014, Nir Soffer wrote:
>>> ----- Original Message -----
>>>> From: "Nir Soffer" <nsoffer(a)redhat.com>
>>>> To: "Boyan Tabakov" <blade(a)alslayer.net>
>>>> Cc: users(a)ovirt.org, "Zdenek Kabelac"
<zkabelac(a)redhat.com>
>>>> Sent: Monday, March 3, 2014 9:39:47 PM
>>>> Subject: Re: [Users] SD Disk's Logical Volume not visible/activated
on
>>>> some nodes
>>>>
>>>> Hi Zdenek, can you look into this strange incident?
>>>>
>>>> When user creates a disk on one host (create a new lv), the lv is not
>>>> seen
>>>> on another host in the cluster.
>>>>
>>>> Calling multipath -r cause the new lv to appear on the other host.
>>>>
>>>> Finally, lvs tell us that vg_mda_free is zero - maybe unrelated, but
>>>> unusual.
>>>>
>>>> ----- Original Message -----
>>>>> From: "Boyan Tabakov" <blade(a)alslayer.net>
>>>>> To: "Nir Soffer" <nsoffer(a)redhat.com>
>>>>> Cc: users(a)ovirt.org
>>>>> Sent: Monday, March 3, 2014 9:51:05 AM
>>>>> Subject: Re: [Users] SD Disk's Logical Volume not
visible/activated on
>>>>> some
>>>>> nodes
>>>>>>>>>>> Consequently, when creating/booting
>>>>>>>>>>> a VM with the said disk attached, the VM
fails to start on host2,
>>>>>>>>>>> because host2 can't see the LV.
Similarly, if the VM is started
>>>>>>>>>>> on
>>>>>>>>>>> host1, it fails to migrate to host2. Extract
from host2 log is in
>>>>>>>>>>> the
>>>>>>>>>>> end. The LV in question is
6b35673e-7062-4716-a6c8-d5bf72fe3280.
>>>>>>>>>>>
>>>>>>>>>>> As far as I could track quickly the vdsm
code, there is only call
>>>>>>>>>>> to
>>>>>>>>>>> lvs
>>>>>>>>>>> and not to lvscan or lvchange so the host2
LVM doesn't fully
>>>>>>>>>>> refresh.
>>>>>>
>>>>>> lvs should see any change on the shared storage.
>>>>>>
>>>>>>>>>>> The only workaround so far has been to
restart VDSM on host2,
>>>>>>>>>>> which
>>>>>>>>>>> makes it refresh all LVM data properly.
>>>>>>
>>>>>> When vdsm starts, it calls multipath -r, which ensure that we
see all
>>>>>> physical volumes.
>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> When is host2 supposed to pick up any newly
created LVs in the SD
>>>>>>>>>>> VG?
>>>>>>>>>>> Any suggestions where the problem might be?
>>>>>>>>>>
>>>>>>>>>> When you create a new lv on the shared storage,
the new lv should
>>>>>>>>>> be
>>>>>>>>>> visible on the other host. Lets start by
verifying that you do see
>>>>>>>>>> the new lv after a disk was created.
>>>>>>>>>>
>>>>>>>>>> Try this:
>>>>>>>>>>
>>>>>>>>>> 1. Create a new disk, and check the disk uuid in
the engine ui
>>>>>>>>>> 2. On another machine, run this command:
>>>>>>>>>>
>>>>>>>>>> lvs -o vg_name,lv_name,tags
>>>>>>>>>>
>>>>>>>>>> You can identify the new lv using tags, which
should contain the
>>>>>>>>>> new
>>>>>>>>>> disk
>>>>>>>>>> uuid.
>>>>>>>>>>
>>>>>>>>>> If you don't see the new lv from the other
host, please provide
>>>>>>>>>> /var/log/messages
>>>>>>>>>> and /var/log/sanlock.log.
>>>>>>>>>
>>>>>>>>> Just tried that. The disk is not visible on the
non-SPM node.
>>>>>>>>
>>>>>>>> This means that storage is not accessible from this
host.
>>>>>>>
>>>>>>> Generally, the storage seems accessible ok. For example, if
I restart
>>>>>>> the vdsmd, all volumes get picked up correctly (become
visible in lvs
>>>>>>> output and VMs can be started with them).
>>>>>>
>>>>>> Lests repeat this test, but now, if you do not see the new lv,
please
>>>>>> run:
>>>>>>
>>>>>> multipath -r
>>>>>>
>>>>>> And report the results.
>>>>>>
>>>>>
>>>>> Running multipath -r helped and the disk was properly picked up by
the
>>>>> second host.
>>>>>
>>>>> Is running multipath -r safe while host is not in maintenance mode?
>>>>
>>>> It should be safe, vdsm uses in some cases.
>>>>
>>>>> If yes, as a temporary workaround I can patch vdsmd to run multipath
-r
>>>>> when e.g. monitoring the storage domain.
>>>>
>>>> I suggested running multipath as debugging aid; normally this is not
>>>> needed.
>>>>
>>>> You should see lv on the shared storage without running multipath.
>>>>
>>>> Zdenek, can you explain this?
>>>>
>>>>>>> One warning that I keep seeing in vdsm logs on both nodes is
this:
>>>>>>>
>>>>>>> Thread-1617881::WARNING::2014-02-24
>>>>>>> 16:57:50,627::sp::1553::Storage.StoragePool::(getInfo) VG
>>>>>>> 3307f6fa-dd58-43db-ab23-b1fb299006c7's metadata size
exceeded
>>>>>>> critical size: mdasize=134217728 mdafree=0
>>>>>>
>>>>>> Can you share the output of the command bellow?
>>>>>>
>>>>>> lvs -o
>>>>>>
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>>>>
>>>>> Here's the output for both hosts.
>>>>>
>>>>> host1:
>>>>> [root@host1 ~]# lvs -o
>>>>>
uuid,name,attr,size,vg_free,vg_extent_size,vg_extent_count,vg_free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count
>>>>> LV UUID LV
>>>>> Attr LSize VFree Ext #Ext Free LV Tags
>>>>>
>>>>> VMdaSize VMdaFree #LV #PV
>>>>> jGEpVm-oPW8-XyxI-l2yi-YF4X-qteQ-dm8SqL
>>>>> 3d362bf2-20f4-438d-9ba9-486bd2e8cedf -wi-ao--- 2.00g 114.62g
128.00m
>>>>> 1596 917
>>>>>
IU_0227da98-34b2-4b0c-b083-d42e7b760036,MD_5,PU_f4231952-76c5-4764-9c8b-ac73492ac465
>>>>> 128.00m 0 13 2
>>>>
>>>> This looks wrong - your vg_mda_free is zero - as vdsm complains.
>
> Patch
http://gerrit.ovirt.org/25408 should solve this issue.
>
> It may also solve the other issue with the missing lv - I could
> not reproduce it yet.
>
> Can you try to apply this patch and report the results?
>
> Thanks,
> Nir
This patch helped, indeed! I tried it on the non-SPM node (as that's
the node that I can currently easily put in maintenance) and the node
started picking up newly created volumes correctly. I also set the
user_lvmetad to 0 in the main lvm.conf, because without it manually
running e.g. lvs was still using the metadata daemon.
I can't confirm yet that this helps with the metadata volume warning,
as that warning appears only on the SPM. I'll be able to put the SPM
node in maintenance soon and will report later.
This issue on Fedora makes me think - is Fedora still fully supported
platform?