[ovirt-users] VMs becoming non-responsive sporadically
nicolas at devels.es
nicolas at devels.es
Sat Apr 30 21:29:39 UTC 2016
El 2016-04-30 21:45, Nir Soffer escribió:
> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás <nicolas at devels.es> wrote:
>> Hi Nir,
>>
>> El 29/04/16 a las 22:34, Nir Soffer escribió:
>>>
>>> On Fri, Apr 29, 2016 at 9:17 PM, <nicolas at devels.es> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some
>>>> issues
>>>> with
>>>> some VMs being paused because they're marked as non-responsive.
>>>> Mostly,
>>>> after a few seconds they recover, but we want to debug precisely
>>>> this
>>>> problem so we can fix it consistently.
>>>>
>>>> Our scenario is the following:
>>>>
>>>> ~495 VMs, of which ~120 are constantly up
>>>> 3 datastores, all of them iSCSI-based:
>>>> * ds1: 2T, currently has 276 disks
>>>> * ds2: 2T, currently has 179 disks
>>>> * ds3: 500G, currently has 65 disks
>>>> 7 hosts: All have mostly the same hardware. CPU and memory are
>>>> currently
>>>> very lowly used (< 10%).
>>>>
>>>> ds1 and ds2 are physically the same backend which exports two 2TB
>>>> volumes.
>>>> ds3 is a different storage backend where we're currently migrating
>>>> some
>>>> disks from ds1 and ds2.
>>>
>>> What the the storage backend behind ds1 and 2?
>>
>>
>> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand
>> P4000 G2.
>>
>>>> Usually, when VMs become unresponsive, the whole host where they run
>>>> gets
>>>> unresponsive too, so that gives a hint about the problem, my bet is
>>>> the
>>>> culprit is somewhere on the host side and not on the VMs side.
>>>
>>> Probably the vm became unresponsive because connection to the host
>>> was
>>> lost.
>>
>>
>> I forgot to mention that less commonly we have situations where the
>> host
>> doesn't get unresponsive but the VMs on it do and they don't become
>> responsive ever again, so we have to forcibly power them off and start
>> them
>> on a different host. But in this case the connection with the host
>> doesn't
>> ever get lost (so basically the host is Up, but any VM run on them is
>> unresponsive).
>>
>>
>>>> When that
>>>> happens, the host itself gets non-responsive and only recoverable
>>>> after
>>>> reboot, since it's unable to reconnect.
>>>
>>> Piotr, can you check engine log and explain why host is not
>>> reconnected?
>>>
>>>> I must say this is not specific to
>>>> this oVirt version, when we were using v.3.6.4 the same happened,
>>>> and
>>>> it's
>>>> also worthy mentioning we've not done any configuration changes and
>>>> everything had been working quite well for a long time.
>>>>
>>>> We were monitoring our ds1 and ds2 physical backend to see
>>>> performance
>>>> and
>>>> we suspect we've run out of IOPS since we're reaching the maximum
>>>> specified
>>>> by the manufacturer, probably at certain times the host cannot
>>>> perform a
>>>> storage operation within some time limit and it marks VMs as
>>>> unresponsive.
>>>> That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3.
>>>> When
>>>> we
>>>> run out of space on ds3 we'll create more smaller volumes to keep
>>>> migrating.
>>>>
>>>> On the host side, when this happens, we've run repoplot on the vdsm
>>>> log
>>>> and
>>>> I'm attaching the result. Clearly there's a *huge* LVM response time
>>>> (~30
>>>> secs.).
>>>
>>> Indeed the log show very slow vgck and vgs commands - these are
>>> called
>>> every
>>> 5 minutes for checking the vg health and refreshing vdsm lvm cache.
>>>
>>> 1. starting vgck
>>>
>>> Thread-96::DEBUG::2016-04-29
>>> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config '
>>> devices
>>> { pre
>>> ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\
>>> '', '\''r|.*|'\'' ] } global { locking_type=1
>>> prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup {
>>> retain_min = 50 retain_days = 0 } ' 5de4a000-a9c4-48
>>> 9c-8eee-10368647c413 (cwd None)
>>>
>>> 2. vgck ends after 55 seconds
>>>
>>> Thread-96::DEBUG::2016-04-29
>>> 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '
>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>> enabling it!\n'; <rc> = 0
>>>
>>> 3. starting vgs
>>>
>>> Thread-96::DEBUG::2016-04-29
>>> 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices
>>> { pref
>>> erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/de
>>>
>>> v/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
>>> '\''r|.*|'\'' ] } global {
>>> locking_type=1 prioritise_write_locks=1 wait_for_locks=1
>>> use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } '
>>> --noheadings --units b --nosuffix --separator '|
>>> ' --ignoreskippedcluster -o
>>>
>>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>> 5de4a000-a9c4-489c-8eee-10368
>>> 647c413 (cwd None)
>>>
>>> 4. vgs finished after 37 seconds
>>>
>>> Thread-96::DEBUG::2016-04-29
>>> 13:17:48,680::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '
>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>> enabling it!\n'; <rc> = 0
>>>
>>> Zdenek, how do you suggest to debug this slow lvm commands?
>>>
>>> Can you run the following commands on a host in trouble, and on some
>>> other
>>> hosts in the same timeframe?
>>>
>>> time vgck -vvvv --config ' devices { filter =
>>> ['\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\'',
>>> '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1
>>> wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50
>>> retain_days = 0 } ' 5de4a000-a9c4-489c-8eee-10368647c413
>>>
>>> time vgs -vvvv --config ' global { locking_type=1
>>> prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup {
>>> retain_min = 50 retain_days = 0 } '
>>> 5de4a000-a9c4-489c-8eee-10368647c413
>>>
>>> Note that I added -vvvv to both commands, this will generate huge
>>> amount
>>> of debugging info. Please share the output of these commands.
>>>
>>> You may need to fix the commands. You can always copy and paste
>>> directly
>>> from vdsm log into the shell and add the -vvvv flag.
>>>
>>
>> Indeed, there seems to be a big difference on hosts 5 and 6. I'm
>> attaching
>> the results of the execution of both commands on all hosts. Both
>> commands
>> show a pretty bigger output on hosts 5 and 6, and also a much bigger
>> execution time. Times are also attached in a file called TIMES.
>>
>>>> Our host storage network is correctly configured and on a 1G
>>>> interface, no errors on the host itself, switches, etc.
>>>
>>> 1G serving about 20-70 vms per host? (495 vms, 120 always up, 7
>>> hosts)?
>>>
>>> Do you have separate network for management and storage, or both
>>> use this 1G interface?
>>
>>
>> Yes, we have separate networks for management, storage and motion.
>> Storage
>> and motion have 1G each (plus, for storage we use a bond of 2
>> interfaces in
>> ALB-mode (6)). Currently no host has more than 30 VMs at a time.
>>
>>>> We've also limited storage in QoS to use 10MB/s and 40 IOPS,
>>>
>>> How did you configure this limit?
>>
>>
>> In the Data Center tab, I chose our DC and a QoS sub-tab appears, just
>> like
>> described here: http://www.ovirt.org/documentation/sla/network-qos/
>>
>>>
>>>> but this issue
>>>> still happens, which leads me to be concerned whether this is not
>>>> just an
>>>> IOPS issue; each host handles about cca. 600 LVs. Could this be an
>>>> issue
>>>> too? I remark the LVM response times are low in normal conditions
>>>> (~1-2
>>>> seconds).
>>>
>>> We recommend to limit number lvs per vg to 350. If you have 276 disks
>>> on
>>> ds1, and the disks are using snapshots, you may have too many lvs,
>>> which
>>> can cause slowdowns in lvm operations.
>>>
>>> Can you share the output of:
>>>
>>> vdsm-tool dump-volume-chains 5de4a000-a9c4-489c-8eee-10368647c413
>>
>>
>> Ok, right now no VG has more than 350 VGs so I guess this is not
>> currently
>> the problem. Unfortunately, I run the vdsm-tool command but it didn't
>> end
>> nor provide any output in cca. 1 hour, so I guess it was hanged and I
>> stopped it.
>
> This is not normal, but I never tried it on a big setup. Basically it
> uses vdsm apis
> to locate all images and volumes, and it this is too slow, it means
> accessing storage
> is too slow.
>
> Did you try to run it on other hosts? (a.g. host1?)
>
> Does it complete in reasonable time after cleaning the stale devices?
>
>> If you confirm this is a normal execution time I can leave it
>> running whatever time it takes.
>>
>> host5:~# vgs
>> VG #PV #LV #SN Attr VSize VFree
>> 0927c050-6fb6-463c-bb37-8b8da641dcd3 1 63 0 wz--n- 499,62g
>> 206,50g
>> 5de4a000-a9c4-489c-8eee-10368647c413 1 335 0 wz--n- 2,00t
>> 518,62g
>> b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8 1 215 0 wz--n- 2,00t
>> 495,25g
>> sys_vg 1 6 0 wz--n- 136,44g
>> 122,94g
>>>>
>>>> I'm attaching the vdsm.log, engine.log and repoplot PDF;
>>>
>>> This is very useful, thanks. Can you send also the vdsm logs and
>>> repoplots
>>> from other hosts for the same timeframe?
>>>
>>>> if someone could
>>>> give a hint on some additional problems in them and shed some light
>>>> on
>>>> the
>>>> above thoughts I'd be very grateful.
>>>
>>> Do you have sar configured on the host? having sar logs can reveal
>>> more
>>> info about this timeframe.
>>>
>>> Do you have information about amount of io from vms during this
>>> timeframe?
>>> amount of io on the storage backend during this timeframe?
>>
>>
>> Not currently, but I'll be alert for the time this happens again and
>> I'll
>> check some sar commands related to I/O and I'll provide feedback.
>> Nevertheless, by the time I run the commands above no machine was
>> unresponsive and I'm still getting such huge execution times. I tried
>> running iotop now and I see there are 2 vgck and vgs processes with a
>> rate
>> of ~500Kb/s each for reading.
>>
>> root 24446 1.3 0.0 57008 6824 ? D< 09:15 0:00
>> /usr/sbin/lvm vgs --config devices { preferred_names =
>> ["^/dev/mapper/"]
>> ignore_suspended_devices=1 write_cache_state=0
>> disable_after_error_count=3
>> filter = [
>> 'a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/dev/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|',
>> 'r|.*|' ] } global { locking_type=1 prioritise_write_locks=1
>> wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50
>> retain_days = 0
>> } --noheadings --units b --nosuffix --separator |
>> --ignoreskippedcluster -o
>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>> b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8
>>
>> I'm also attaching the iostat output for host 5 in a file called
>> iostat-host5.
>>
>> Thank you very much.
>>
>> Nicolás
>>
>>> Nir
>>
>>
The loop you provided indeed seems to solve the issue (at least as a
workaround). Today host5 has been flapping all day with VMs becoming
unresponsive and responsive again, and after running the loop VMs
stopped being unresponsive, plus execution times became normal again
(~1-2 secs, as the other hosts).
After clearing stale devices, I tried running the vdsm-tool command
again and this time successfully (it took about 1:30 min to finish).
Also, running the same command on host1 took 2:15 min. but that is
because I've not run the script on this host and it still has some stale
devices, so I guess this is reasonable execution time. I'm attaching the
results here.
I've also filed a bug report in [1] with all relevant information, I
hope I didn't forget anything important.
Thanks for your help and for taking the time to help debugging this
issue.
Nicolás
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1331978
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dump-volume-chains.zip
Type: application/zip
Size: 14712 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160430/9aa3a577/attachment-0001.zip>
More information about the Users
mailing list