[Users] SD Disk's Logical Volume not visible/activated on some nodes

Boyan Tabakov blade at alslayer.net
Tue Feb 25 09:53:45 UTC 2014


Hello,

On 22.2.2014, 22:19, Nir Soffer wrote:
> ----- Original Message -----
>> From: "Boyan Tabakov" <blade at alslayer.net>
>> To: "Nir Soffer" <nsoffer at redhat.com>
>> Cc: users at ovirt.org
>> Sent: Wednesday, February 19, 2014 7:18:36 PM
>> Subject: Re: [Users] SD Disk's Logical Volume not visible/activated on some nodes
>>
>> Hello,
>>
>> On 19.2.2014, 17:09, Nir Soffer wrote:
>>> ----- Original Message -----
>>>> From: "Boyan Tabakov" <blade at alslayer.net>
>>>> To: users at ovirt.org
>>>> Sent: Tuesday, February 18, 2014 3:34:49 PM
>>>> Subject: [Users] SD Disk's Logical Volume not visible/activated on some
>>>> nodes
>>>>
>>>> Hello,
>>>>
>>>> I have ovirt 3.3 installed on on FC 19 hosts with vdsm 4.13.3-2.fc19.
>>>
>>> Which version of ovirt 3.3 is this? (3.3.2? 3.3.3?)
>>
>> ovirt-engine is 3.3.2-1.fc19
>>
>>>> One of the hosts (host1) is engine + node + SPM and the other host2 is
>>>> just a node. I have an iSCSI storage domain configured and accessible
>>>> from both nodes.
>>>>
>>>> When creating a new disk in the SD, the underlying logical volume gets
>>>> properly created (seen in vgdisplay output on host1), but doesn't seem
>>>> to be automatically picked by host2.
>>>
>>> How do you know it is not seen on host2?
>>
>> It's not present in the output of vgdisplay -v nor vgs.
>>
>>>
>>>> Consequently, when creating/booting
>>>> a VM with the said disk attached, the VM fails to start on host2,
>>>> because host2 can't see the LV. Similarly, if the VM is started on
>>>> host1, it fails to migrate to host2. Extract from host2 log is in the
>>>> end. The LV in question is 6b35673e-7062-4716-a6c8-d5bf72fe3280.
>>>>
>>>> As far as I could track quickly the vdsm code, there is only call to lvs
>>>> and not to lvscan or lvchange so the host2 LVM doesn't fully refresh.
>>>> The only workaround so far has been to restart VDSM on host2, which
>>>> makes it refresh all LVM data properly.
>>>>
>>>> When is host2 supposed to pick up any newly created LVs in the SD VG?
>>>> Any suggestions where the problem might be?
>>>
>>> When you create a new lv on the shared storage, the new lv should be
>>> visible on the other host. Lets start by verifying that you do see
>>> the new lv after a disk was created.
>>>
>>> Try this:
>>>
>>> 1. Create a new disk, and check the disk uuid in the engine ui
>>> 2. On another machine, run this command:
>>>
>>> lvs -o vg_name,lv_name,tags
>>>
>>> You can identify the new lv using tags, which should contain the new disk
>>> uuid.
>>>
>>> If you don't see the new lv from the other host, please provide
>>> /var/log/messages
>>> and /var/log/sanlock.log.
>>
>> Just tried that. The disk is not visible on the non-SPM node.
> 
> This means that storage is not accessible from this host.

Generally, the storage seems accessible ok. For example, if I restart
the vdsmd, all volumes get picked up correctly (become visible in lvs
output and VMs can be started with them).

> 
>>
>> On the SPM node (where the LV is visible in vgs output):
>>
>> Feb 19 19:10:43 host1 vdsm root WARNING File:
>> /rhev/data-center/61f15cc0-8bba-482d-8a81-cd636a581b58/3307f6fa-dd58-43db-ab23-b1fb299006c7/images/4d15543c-4c45-4c23-bbe3-f10b9084472a/3e0ce8cb-3740-49d7-908e-d025875ac9a2
>> already removed
>> Feb 19 19:10:45 host1 multipathd: dm-65: remove map (uevent)
>> Feb 19 19:10:45 host1 multipathd: dm-65: devmap not registered, can't remove
>> Feb 19 19:10:45 host1 multipathd: dm-65: remove map (uevent)
>> Feb 19 19:10:54 host1 kernel: [1652684.864746] dd: sending ioctl
>> 80306d02 to a partition!
>> Feb 19 19:10:54 host1 kernel: [1652684.963931] dd: sending ioctl
>> 80306d02 to a partition!
>>
>> No recent entries in sanlock.log on the SPM node.
>>
>> On the non-SPM node (the one that doesn't show the LV in vgs output),
>> there are no relevant entries in /var/log/messages.
> 
> Strange - sanlock errors are logged to /var/log/messages. It would be helpful if
> you attach this log - we may find something in it.

No entries appear in /var/log/messages, other than the quoted above
(Sorry, I didn't clarify that that was from /var/log/messages on host1).

> 
>> Here's the full
>> sanlock.log for that host:
>>
>> 2014-01-30 16:28:09+0200 1324 [2335]: sanlock daemon started 2.8 host
>> 18bd0a27-c280-4007-98f2-d2e7e73cd8b5.xenon.futu
>> 2014-01-30 16:59:51+0200 5 [609]: sanlock daemon started 2.8 host
>> 4a7627e2-296a-4e48-a7e2-f6bcecac07ab.xenon.futu
>> 2014-01-31 09:51:43+0200 60717 [614]: s1 lockspace
>> 3307f6fa-dd58-43db-ab23-b1fb299006c7:2:/dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids:0
>> 2014-01-31 16:03:51+0200 83045 [613]: s1:r1 resource
>> 3307f6fa-dd58-43db-ab23-b1fb299006c7:SDM:/dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/leases:1048576
>> for 8,16,30268
>> 2014-01-31 16:18:01+0200 83896 [614]: s1:r2 resource
>> 3307f6fa-dd58-43db-ab23-b1fb299006c7:SDM:/dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/leases:1048576
>> for 8,16,30268
>> 2014-02-06 05:24:10+0200 563065 [31453]: 3307f6fa aio timeout 0
>> 0x7fc37c0008c0:0x7fc37c0008d0:0x7fc391f5f000 ioto 10 to_count 1
>> 2014-02-06 05:24:10+0200 563065 [31453]: s1 delta_renew read rv -202
>> offset 0 /dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids
> 
> Sanlock cannot write to the ids lockspace

Which line shows that sanlock can't write? The messages are not very
"human readable".
> 
>> 2014-02-06 05:24:10+0200 563065 [31453]: s1 renewal error -202
>> delta_length 10 last_success 563034
>> 2014-02-06 05:24:21+0200 563076 [31453]: 3307f6fa aio timeout 0
>> 0x7fc37c000910:0x7fc37c000920:0x7fc391d5c000 ioto 10 to_count 2
>> 2014-02-06 05:24:21+0200 563076 [31453]: s1 delta_renew read rv -202
>> offset 0 /dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids
>> 2014-02-06 05:24:21+0200 563076 [31453]: s1 renewal error -202
>> delta_length 11 last_success 563034
>> 2014-02-06 05:24:32+0200 563087 [31453]: 3307f6fa aio timeout 0
>> 0x7fc37c000960:0x7fc37c000970:0x7fc391c5a000 ioto 10 to_count 3
>> 2014-02-06 05:24:32+0200 563087 [31453]: s1 delta_renew read rv -202
>> offset 0 /dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids
>> 2014-02-06 05:24:32+0200 563087 [31453]: s1 renewal error -202
>> delta_length 11 last_success 563034
>> 2014-02-06 05:24:40+0200 563094 [609]: s1 check_our_lease warning 60
>> last_success 563034
>> 2014-02-06 05:24:41+0200 563095 [609]: s1 check_our_lease warning 61
>> last_success 563034
>> 2014-02-06 05:24:42+0200 563096 [609]: s1 check_our_lease warning 62
>> last_success 563034
>> 2014-02-06 05:24:43+0200 563097 [609]: s1 check_our_lease warning 63
>> last_success 563034
>> 2014-02-06 05:24:43+0200 563098 [31453]: 3307f6fa aio timeout 0
>> 0x7fc37c0009b0:0x7fc37c0009c0:0x7fc391b58000 ioto 10 to_count 4
>> 2014-02-06 05:24:43+0200 563098 [31453]: s1 delta_renew read rv -202
>> offset 0 /dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids
>> 2014-02-06 05:24:43+0200 563098 [31453]: s1 renewal error -202
>> delta_length 11 last_success 563034
>> 2014-02-06 05:24:44+0200 563098 [609]: s1 check_our_lease warning 64
>> last_success 563034
>> 2014-02-06 05:24:45+0200 563099 [609]: s1 check_our_lease warning 65
>> last_success 563034
>> 2014-02-06 05:24:46+0200 563100 [609]: s1 check_our_lease warning 66
>> last_success 563034
>> 2014-02-06 05:24:47+0200 563101 [609]: s1 check_our_lease warning 67
>> last_success 563034
>> 2014-02-06 05:24:48+0200 563102 [609]: s1 check_our_lease warning 68
>> last_success 563034
>> 2014-02-06 05:24:49+0200 563103 [609]: s1 check_our_lease warning 69
>> last_success 563034
>> 2014-02-06 05:24:50+0200 563104 [609]: s1 check_our_lease warning 70
>> last_success 563034
>> 2014-02-06 05:24:51+0200 563105 [609]: s1 check_our_lease warning 71
>> last_success 563034
>> 2014-02-06 05:24:52+0200 563106 [609]: s1 check_our_lease warning 72
>> last_success 563034
>> 2014-02-06 05:24:53+0200 563107 [609]: s1 check_our_lease warning 73
>> last_success 563034
>> 2014-02-06 05:24:54+0200 563108 [609]: s1 check_our_lease warning 74
>> last_success 563034
>> 2014-02-06 05:24:54+0200 563109 [31453]: s1 delta_renew read rv -2
>> offset 0 /dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids
>> 2014-02-06 05:24:54+0200 563109 [31453]: s1 renewal error -2
>> delta_length 11 last_success 563034
>> 2014-02-06 05:24:55+0200 563109 [609]: s1 check_our_lease warning 75
>> last_success 563034
>> 2014-02-06 05:24:56+0200 563110 [609]: s1 check_our_lease warning 76
>> last_success 563034
>> 2014-02-06 05:24:57+0200 563111 [609]: s1 check_our_lease warning 77
>> last_success 563034
>> 2014-02-06 05:24:58+0200 563112 [609]: s1 check_our_lease warning 78
>> last_success 563034
>> 2014-02-06 05:24:59+0200 563113 [609]: s1 check_our_lease warning 79
>> last_success 563034
>> 2014-02-06 05:25:00+0200 563114 [609]: s1 check_our_lease failed 80
>> 2014-02-06 05:25:00+0200 563114 [609]: s1 all pids clear
>> 2014-02-06 05:25:05+0200 563119 [31453]: s1 delta_renew read rv -2
>> offset 0 /dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids
>> 2014-02-06 05:25:05+0200 563119 [31453]: s1 renewal error -2
>> delta_length 10 last_success 563034
>> 2014-02-06 05:25:15+0200 563129 [31453]: 3307f6fa close_task_aio 0
>> 0x7fc37c0008c0 busy
>> 2014-02-06 05:25:15+0200 563129 [31453]: 3307f6fa close_task_aio 1
>> 0x7fc37c000910 busy
>> 2014-02-06 05:25:15+0200 563129 [31453]: 3307f6fa close_task_aio 2
>> 0x7fc37c000960 busy
>> 2014-02-06 05:25:15+0200 563129 [31453]: 3307f6fa close_task_aio 3
>> 0x7fc37c0009b0 busy
>> 2014-02-06 05:25:25+0200 563139 [31453]: 3307f6fa close_task_aio 0
>> 0x7fc37c0008c0 busy
>> 2014-02-06 05:25:25+0200 563139 [31453]: 3307f6fa close_task_aio 1
>> 0x7fc37c000910 busy
>> 2014-02-06 05:25:25+0200 563139 [31453]: 3307f6fa close_task_aio 2
>> 0x7fc37c000960 busy
>> 2014-02-06 05:25:25+0200 563139 [31453]: 3307f6fa close_task_aio 3
>> 0x7fc37c0009b0 busy
>> 2014-02-06 05:25:35+0200 563149 [31453]: 3307f6fa close_task_aio 0
>> 0x7fc37c0008c0 busy
>> 2014-02-06 05:25:35+0200 563149 [31453]: 3307f6fa close_task_aio 1
>> 0x7fc37c000910 busy
>> 2014-02-06 05:25:35+0200 563149 [31453]: 3307f6fa close_task_aio 2
>> 0x7fc37c000960 busy
>> 2014-02-06 05:25:35+0200 563149 [31453]: 3307f6fa close_task_aio 3
>> 0x7fc37c0009b0 busy
>> 2014-02-06 05:25:45+0200 563159 [31453]: 3307f6fa close_task_aio 0
>> 0x7fc37c0008c0 busy
>> 2014-02-06 05:25:45+0200 563159 [31453]: 3307f6fa close_task_aio 1
>> 0x7fc37c000910 busy
>> 2014-02-06 05:25:45+0200 563159 [31453]: 3307f6fa close_task_aio 2
>> 0x7fc37c000960 busy
>> 2014-02-06 05:25:45+0200 563159 [31453]: 3307f6fa close_task_aio 3
>> 0x7fc37c0009b0 busy
>> 2014-02-06 05:25:55+0200 563169 [31453]: 3307f6fa close_task_aio 0
>> 0x7fc37c0008c0 busy
>> 2014-02-06 05:25:55+0200 563169 [31453]: 3307f6fa close_task_aio 1
>> 0x7fc37c000910 busy
>> 2014-02-06 05:25:55+0200 563169 [31453]: 3307f6fa close_task_aio 2
>> 0x7fc37c000960 busy
>> 2014-02-06 05:25:55+0200 563169 [31453]: 3307f6fa close_task_aio 3
>> 0x7fc37c0009b0 busy
>> 2014-02-06 05:25:59+0200 563173 [31453]: 3307f6fa aio collect
>> 0x7fc37c0008c0:0x7fc37c0008d0:0x7fc391f5f000 result 1048576:0 close free
>> 2014-02-06 05:25:59+0200 563173 [31453]: 3307f6fa aio collect
>> 0x7fc37c000910:0x7fc37c000920:0x7fc391d5c000 result 1048576:0 close free
>> 2014-02-06 05:25:59+0200 563173 [31453]: 3307f6fa aio collect
>> 0x7fc37c000960:0x7fc37c000970:0x7fc391c5a000 result 1048576:0 close free
>> 2014-02-06 05:25:59+0200 563173 [31453]: 3307f6fa aio collect
>> 0x7fc37c0009b0:0x7fc37c0009c0:0x7fc391b58000 result 1048576:0 close free
>> 2014-02-06 05:26:09+0200 563183 [614]: s2 lockspace
>> 3307f6fa-dd58-43db-ab23-b1fb299006c7:2:/dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids:0
> 
> Sanlock can access the storage again - did you change something around this time?

Unfortunately, I can't recall. The node was generally used during
between 6th and 18th, but all running VMs used directly attached iSCSI
disks, instead of disks from the Storage Domain. I don't think any
configuration changes were done to wither the host, nor storage at that
time.

> 
>> 2014-02-18 14:22:16+0200 1632150 [614]: s3 lockspace
>> 3307f6fa-dd58-43db-ab23-b1fb299006c7:2:/dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids:0
>> 2014-02-18 14:38:16+0200 1633111 [614]: s4 lockspace
>> 3307f6fa-dd58-43db-ab23-b1fb299006c7:2:/dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids:0
>> 2014-02-18 16:13:07+0200 1638801 [613]: s5 lockspace
>> 3307f6fa-dd58-43db-ab23-b1fb299006c7:2:/dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids:0
>> 2014-02-18 19:28:09+0200 1650504 [614]: s6 lockspace
>> 3307f6fa-dd58-43db-ab23-b1fb299006c7:2:/dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids:0
>>
>> Last entry is from yesterday, while I just created a new disk.
> 
> What was the status of this host in the engine from 2014-02-06 05:24:10+0200 to 2014-02-18 14:22:16?
> 
> vdsm.log and engine.log for this time frame will make it more clear.
> 
> Nir
> 

Host was up and running. The vdsm and engine logs are quite large, as we
were running some VM migrations between the hosts. Any pointers at what
to look for? For example, I noticed many entries in engine.log like this:

2014-02-24 17:02:26,635 INFO
[org.ovirt.engine.core.vdsbroker.VdsManager]
(DefaultQuartzScheduler_Worker-46) OnVdsDuringFailureTimer of vds host2
entered. Attempts after -580

The "attempt after" number continues to decrease by one every half hour.

There was probably some issue on 02.06, I can see this in engine log:

2014-02-06 05:25:03,078 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) domain
3307f6fa-dd58-43db-ab23-b1fb299006c7:OtaverkkoDefault in problem. vds: host2
2014-02-06 05:26:34,126 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) Domain
3307f6fa-dd58-43db-ab23-b1fb299006c7:OtaverkkoDefault recovered from
problem. vds: host2
2014-02-06 05:26:34,135 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) Domain
3307f6fa-dd58-43db-ab23-b1fb299006c7:OtaverkkoDefault has recovered from
problem. No active host in the DC is reporting it as problematic, so
clearing the domain recovery timer.

However, that was probably temporary.

At the moment, if I force host2 to become SPM, then the issue repeats -
this time host1 doesn't pick up newly created disks (and again, no
messages in /var/log/messages or sanlock). So, I don't think there is
issue with connection to storage on one host.

What exactly is the mechanism by which the non-SPM nodes should pick up
newly created disks? Is the engine sending them command to refresh, or
are they supposed to monitor the metadata of the Storage Domain volume
group?

One warning that I keep seeing in vdsm logs on both nodes is this:

Thread-1617881::WARNING::2014-02-24
16:57:50,627::sp::1553::Storage.StoragePool::(getInfo) VG
3307f6fa-dd58-43db-ab23-b1fb299006c7's metadata size exceeded
 critical size: mdasize=134217728 mdafree=0

This was happening even when the storage domain was newly created and
there were only one or two disks on it! I can't imagine how that could
be "too many".

Thanks,
Boyan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 268 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140225/b1b30097/attachment-0001.sig>


More information about the Users mailing list