Re: [Users] SD Disk's Logical Volume not visible/activated on some nodes

28 Feb 2014

      ----- Original Message -----
...
From: "Boyan Tabakov" <blade@alslayer.net>
To: "Nir Soffer" <nsoffer@redhat.com>
Cc: users@ovirt.org
Sent: Tuesday, February 25, 2014 11:53:45 AM
Subject: Re: [Users] SD Disk's Logical Volume not visible/activated on some nodes
Hello,
On 22.2.2014, 22:19, Nir Soffer wrote:
...
----- Original Message -----
...
From: "Boyan Tabakov" <blade@alslayer.net>
To: "Nir Soffer" <nsoffer@redhat.com>
Cc: users@ovirt.org
Sent: Wednesday, February 19, 2014 7:18:36 PM
Subject: Re: [Users] SD Disk's Logical Volume not visible/activated on
some nodes
Hello,
On 19.2.2014, 17:09, Nir Soffer wrote:
...
----- Original Message -----
...
From: "Boyan Tabakov" <blade@alslayer.net>
To: users@ovirt.org
Sent: Tuesday, February 18, 2014 3:34:49 PM
Subject: [Users] SD Disk's Logical Volume not visible/activated on some
nodes
...
Consequently, when creating/booting
a VM with the said disk attached, the VM fails to start on host2,
because host2 can't see the LV. Similarly, if the VM is started on
host1, it fails to migrate to host2. Extract from host2 log is in the
end. The LV in question is 6b35673e-7062-4716-a6c8-d5bf72fe3280.
As far as I could track quickly the vdsm code, there is only call to lvs
and not to lvscan or lvchange so the host2 LVM doesn't fully refresh.
lvs should see any change on the shared storage.
...
...
...
...
...
The only workaround so far has been to restart VDSM on host2, which
makes it refresh all LVM data properly.
When vdsm starts, it calls multipath -r, which ensure that we see all physical volumes.
...
...
...
...
...
When is host2 supposed to pick up any newly created LVs in the SD VG?
Any suggestions where the problem might be?
When you create a new lv on the shared storage, the new lv should be
visible on the other host. Lets start by verifying that you do see
the new lv after a disk was created.
Try this:
1. Create a new disk, and check the disk uuid in the engine ui
2. On another machine, run this command:
lvs -o vg_name,lv_name,tags
You can identify the new lv using tags, which should contain the new disk
uuid.
If you don't see the new lv from the other host, please provide
/var/log/messages
and /var/log/sanlock.log.
Just tried that. The disk is not visible on the non-SPM node.
This means that storage is not accessible from this host.
Generally, the storage seems accessible ok. For example, if I restart
the vdsmd, all volumes get picked up correctly (become visible in lvs
output and VMs can be started with them).
Lests repeat this test, but now, if you do not see the new lv, please 
run:

    multipath -r

And report the results.
...
...
...
Here's the full
sanlock.log for that host:
...
0x7fc37c0008c0:0x7fc37c0008d0:0x7fc391f5f000 ioto 10 to_count 1
2014-02-06 05:24:10+0200 563065 [31453]: s1 delta_renew read rv -202
offset 0 /dev/3307f6fa-dd58-43db-ab23-b1fb299006c7/ids
Sanlock cannot write to the ids lockspace
Which line shows that sanlock can't write? The messages are not very
"human readable".
The one above my comment at 2014-02-06 05:24:10+0200

I suggest to set sanlock debug level on the sanlock log to get more detailed output.

Edit /etc/sysconfig/sanlock and add:

# -L 7: use debug level logging to sanlock log file
SANLOCKOPTS="$SANLOCKOPTS -L 7"
...
...
...
Last entry is from yesterday, while I just created a new disk.
What was the status of this host in the engine from 2014-02-06
05:24:10+0200 to 2014-02-18 14:22:16?
vdsm.log and engine.log for this time frame will make it more clear.
Host was up and running. The vdsm and engine logs are quite large, as we
were running some VM migrations between the hosts. Any pointers at what
to look for? For example, I noticed many entries in engine.log like this:
It will be hard to make any progress without the logs.
...
One warning that I keep seeing in vdsm logs on both nodes is this:
Thread-1617881::WARNING::2014-02-24
16:57:50,627::sp::1553::Storage.StoragePool::(getInfo) VG
3307f6fa-dd58-43db-ab23-b1fb299006c7's metadata size exceeded
 critical size: mdasize=134217728 mdafree=0
Can you share the output of the command bellow?

    lvs -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name

I suggest that you open a bug and attach there  engine.log, /var/log/messages, vdsm.log and sanlock.log.

Please also give detailed info on the host os, vdsm version etc.

Nir