[ovirt-users] VMs becoming non-responsive sporadically

Sat Apr 30 21:29:39 UTC 2016

El 2016-04-30 21:45, Nir Soffer escribió:
> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás <nicolas at devels.es> wrote:
>> Hi Nir,
>> 
>> El 29/04/16 a las 22:34, Nir Soffer escribió:
>>> 
>>> On Fri, Apr 29, 2016 at 9:17 PM,  <nicolas at devels.es> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some 
>>>> issues
>>>> with
>>>> some VMs being paused because they're marked as non-responsive. 
>>>> Mostly,
>>>> after a few seconds they recover, but we want to debug precisely 
>>>> this
>>>> problem so we can fix it consistently.
>>>> 
>>>> Our scenario is the following:
>>>> 
>>>> ~495 VMs, of which ~120 are constantly up
>>>> 3 datastores, all of them iSCSI-based:
>>>>    * ds1: 2T, currently has 276 disks
>>>>    * ds2: 2T, currently has 179 disks
>>>>    * ds3: 500G, currently has 65 disks
>>>> 7 hosts: All have mostly the same hardware. CPU and memory are 
>>>> currently
>>>> very lowly used (< 10%).
>>>> 
>>>>    ds1 and ds2 are physically the same backend which exports two 2TB
>>>> volumes.
>>>> ds3 is a different storage backend where we're currently migrating 
>>>> some
>>>> disks from ds1 and ds2.
>>> 
>>> What the the storage backend behind ds1 and 2?
>> 
>> 
>> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand 
>> P4000 G2.
>> 
>>>> Usually, when VMs become unresponsive, the whole host where they run 
>>>> gets
>>>> unresponsive too, so that gives a hint about the problem, my bet is 
>>>> the
>>>> culprit is somewhere on the host side and not on the VMs side.
>>> 
>>> Probably the vm became unresponsive because connection to the host 
>>> was
>>> lost.
>> 
>> 
>> I forgot to mention that less commonly we have situations where the 
>> host
>> doesn't get unresponsive but the VMs on it do and they don't become
>> responsive ever again, so we have to forcibly power them off and start 
>> them
>> on a different host. But in this case the connection with the host 
>> doesn't
>> ever get lost (so basically the host is Up, but any VM run on them is
>> unresponsive).
>> 
>> 
>>>> When that
>>>> happens, the host itself gets non-responsive and only recoverable 
>>>> after
>>>> reboot, since it's unable to reconnect.
>>> 
>>> Piotr, can you check engine log and explain why host is not 
>>> reconnected?
>>> 
>>>> I must say this is not specific to
>>>> this oVirt version, when we were using v.3.6.4 the same happened, 
>>>> and
>>>> it's
>>>> also worthy mentioning we've not done any configuration changes and
>>>> everything had been working quite well for a long time.
>>>> 
>>>> We were monitoring our ds1 and ds2 physical backend to see 
>>>> performance
>>>> and
>>>> we suspect we've run out of IOPS since we're reaching the maximum
>>>> specified
>>>> by the manufacturer, probably at certain times the host cannot 
>>>> perform a
>>>> storage operation within some time limit and it marks VMs as
>>>> unresponsive.
>>>> That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3. 
>>>> When
>>>> we
>>>> run out of space on ds3 we'll create more smaller volumes to keep
>>>> migrating.
>>>> 
>>>> On the host side, when this happens, we've run repoplot on the vdsm 
>>>> log
>>>> and
>>>> I'm attaching the result. Clearly there's a *huge* LVM response time 
>>>> (~30
>>>> secs.).
>>> 
>>> Indeed the log show very slow vgck and vgs commands - these are 
>>> called
>>> every
>>> 5 minutes for checking the vg health and refreshing vdsm lvm cache.
>>> 
>>> 1. starting vgck
>>> 
>>> Thread-96::DEBUG::2016-04-29
>>> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' 
>>> devices
>>> { pre
>>> ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\
>>> '', '\''r|.*|'\'' ] }  global {  locking_type=1
>>> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
>>> retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
>>> 9c-8eee-10368647c413 (cwd None)
>>> 
>>> 2. vgck ends after 55 seconds
>>> 
>>> Thread-96::DEBUG::2016-04-29
>>> 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '
>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>> enabling it!\n'; <rc> = 0
>>> 
>>> 3. starting vgs
>>> 
>>> Thread-96::DEBUG::2016-04-29
>>> 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices
>>> { pref
>>> erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/de
>>> 
>>> v/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
>>> '\''r|.*|'\'' ] }  global {
>>>   locking_type=1  prioritise_write_locks=1  wait_for_locks=1
>>> use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
>>> --noheadings --units b --nosuffix --separator '|
>>> ' --ignoreskippedcluster -o
>>> 
>>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>> 5de4a000-a9c4-489c-8eee-10368
>>> 647c413 (cwd None)
>>> 
>>> 4. vgs finished after 37 seconds
>>> 
>>> Thread-96::DEBUG::2016-04-29
>>> 13:17:48,680::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '
>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>> enabling it!\n'; <rc> = 0
>>> 
>>> Zdenek, how do you suggest to debug this slow lvm commands?
>>> 
>>> Can you run the following commands on a host in trouble, and on some 
>>> other
>>> hosts in the same timeframe?
>>> 
>>> time vgck -vvvv --config ' devices { filter =
>>> ['\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\'',
>>> '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1
>>> wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50
>>> retain_days = 0 } ' 5de4a000-a9c4-489c-8eee-10368647c413
>>> 
>>> time vgs -vvvv --config ' global { locking_type=1
>>> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
>>> retain_min = 50  retain_days = 0 } '
>>> 5de4a000-a9c4-489c-8eee-10368647c413
>>> 
>>> Note that I added -vvvv to both commands, this will generate huge 
>>> amount
>>> of debugging info. Please share the output of these commands.
>>> 
>>> You may need to fix the commands. You can always copy and paste 
>>> directly
>>> from vdsm log into the shell and add the -vvvv flag.
>>> 
>> 
>> Indeed, there seems to be a big difference on hosts 5 and 6. I'm 
>> attaching
>> the results of the execution of both commands on all hosts. Both 
>> commands
>> show a pretty bigger output on hosts 5 and 6, and also a much bigger
>> execution time. Times are also attached in a file called TIMES.
>> 
>>>> Our host storage network is correctly configured and on a 1G
>>>> interface, no errors on the host itself, switches, etc.
>>> 
>>> 1G serving about 20-70 vms per host? (495 vms, 120 always up, 7 
>>> hosts)?
>>> 
>>> Do you have separate network for management and storage, or both
>>> use this 1G interface?
>> 
>> 
>> Yes, we have separate networks for management, storage and motion. 
>> Storage
>> and motion have 1G each (plus, for storage we use a bond of 2 
>> interfaces in
>> ALB-mode (6)). Currently no host has more than 30 VMs at a time.
>> 
>>>> We've also limited storage in QoS to use 10MB/s and 40 IOPS,
>>> 
>>> How did you configure this limit?
>> 
>> 
>> In the Data Center tab, I chose our DC and a QoS sub-tab appears, just 
>> like
>> described here: http://www.ovirt.org/documentation/sla/network-qos/
>> 
>>> 
>>>> but this issue
>>>> still happens, which leads me to be concerned whether this is not 
>>>> just an
>>>> IOPS issue; each host handles about cca. 600 LVs. Could this be an 
>>>> issue
>>>> too? I remark the LVM response times are low in normal conditions 
>>>> (~1-2
>>>> seconds).
>>> 
>>> We recommend to limit number lvs per vg to 350. If you have 276 disks 
>>> on
>>> ds1, and the disks are using snapshots, you may have too many lvs, 
>>> which
>>> can cause slowdowns in lvm operations.
>>> 
>>> Can you share the output of:
>>> 
>>> vdsm-tool dump-volume-chains 5de4a000-a9c4-489c-8eee-10368647c413
>> 
>> 
>> Ok, right now no VG has more than 350 VGs so I guess this is not 
>> currently
>> the problem. Unfortunately, I run the vdsm-tool command but it didn't 
>> end
>> nor provide any output in cca. 1 hour, so I guess it was hanged and I
>> stopped it.
> 
> This is not normal, but I never tried it on a big setup. Basically it
> uses vdsm apis
> to locate all images and volumes, and it this is too slow, it means
> accessing storage
> is too slow.
> 
> Did you try to run it on other hosts? (a.g. host1?)
> 
> Does it complete in reasonable time after cleaning the stale devices?
> 
>> If you confirm this is a normal execution time I can leave it
>> running whatever time it takes.
>> 
>> host5:~# vgs
>>   VG                                   #PV #LV #SN Attr   VSize VFree
>>   0927c050-6fb6-463c-bb37-8b8da641dcd3   1  63   0 wz--n- 499,62g 
>> 206,50g
>>   5de4a000-a9c4-489c-8eee-10368647c413   1 335   0 wz--n-   2,00t 
>> 518,62g
>>   b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8   1 215   0 wz--n-   2,00t 
>> 495,25g
>>   sys_vg                                 1   6   0 wz--n- 136,44g 
>> 122,94g
>>>> 
>>>> I'm attaching the vdsm.log, engine.log and repoplot PDF;
>>> 
>>> This is very useful, thanks. Can you send also the vdsm logs and 
>>> repoplots
>>> from other hosts for the same timeframe?
>>> 
>>>> if someone could
>>>> give a hint on some additional problems in them and shed some light 
>>>> on
>>>> the
>>>> above thoughts I'd be very grateful.
>>> 
>>> Do you have sar configured on the host? having sar logs can reveal 
>>> more
>>> info about this timeframe.
>>> 
>>> Do you have information about amount of io from vms during this 
>>> timeframe?
>>> amount of io on the storage backend during this timeframe?
>> 
>> 
>> Not currently, but I'll be alert for the time this happens again and 
>> I'll
>> check some sar commands related to I/O and I'll provide feedback.
>> Nevertheless, by the time I run the commands above no machine was
>> unresponsive and I'm still getting such huge execution times. I tried
>> running iotop now and I see there are 2 vgck and vgs processes with a 
>> rate
>> of ~500Kb/s each for reading.
>> 
>> root     24446  1.3  0.0  57008  6824 ?        D<   09:15   0:00
>> /usr/sbin/lvm vgs --config  devices { preferred_names = 
>> ["^/dev/mapper/"]
>> ignore_suspended_devices=1 write_cache_state=0 
>> disable_after_error_count=3
>> filter = [
>> 'a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/dev/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|',
>> 'r|.*|' ] }  global {  locking_type=1  prioritise_write_locks=1
>> wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50 
>> retain_days = 0
>> }  --noheadings --units b --nosuffix --separator | 
>> --ignoreskippedcluster -o
>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>> b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8
>> 
>> I'm also attaching the iostat output for host 5 in a file called
>> iostat-host5.
>> 
>> Thank you very much.
>> 
>> Nicolás
>> 
>>> Nir
>> 
>> 

The loop you provided indeed seems to solve the issue (at least as a 
workaround). Today host5 has been flapping all day with VMs becoming 
unresponsive and responsive again, and after running the loop VMs 
stopped being unresponsive, plus execution times became normal again 
(~1-2 secs, as the other hosts).

After clearing stale devices, I tried running the vdsm-tool command 
again and this time successfully (it took about 1:30 min to finish). 
Also, running the same command on host1 took 2:15 min. but that is 
because I've not run the script on this host and it still has some stale 
devices, so I guess this is reasonable execution time. I'm attaching the 
results here.

I've also filed a bug report in [1] with all relevant information, I 
hope I didn't forget anything important.

Thanks for your help and for taking the time to help debugging this 
issue.

Nicolás

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1331978
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dump-volume-chains.zip
Type: application/zip
Size: 14712 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160430/9aa3a577/attachment-0001.zip>