[ovirt-users] VMs becoming non-responsive sporadically

Nir Soffer nsoffer at redhat.com
Sat Apr 30 15:55:06 UTC 2016


On Sat, Apr 30, 2016 at 11:33 AM, Nicolás <nicolas at devels.es> wrote:
> Hi Nir,
>
> El 29/04/16 a las 22:34, Nir Soffer escribió:
>>
>> On Fri, Apr 29, 2016 at 9:17 PM,  <nicolas at devels.es> wrote:
>>>
>>> Hi,
>>>
>>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues
>>> with
>>> some VMs being paused because they're marked as non-responsive. Mostly,
>>> after a few seconds they recover, but we want to debug precisely this
>>> problem so we can fix it consistently.
>>>
>>> Our scenario is the following:
>>>
>>> ~495 VMs, of which ~120 are constantly up
>>> 3 datastores, all of them iSCSI-based:
>>>    * ds1: 2T, currently has 276 disks
>>>    * ds2: 2T, currently has 179 disks
>>>    * ds3: 500G, currently has 65 disks
>>> 7 hosts: All have mostly the same hardware. CPU and memory are currently
>>> very lowly used (< 10%).
>>>
>>>    ds1 and ds2 are physically the same backend which exports two 2TB
>>> volumes.
>>> ds3 is a different storage backend where we're currently migrating some
>>> disks from ds1 and ds2.
>>
>> What the the storage backend behind ds1 and 2?
>
>
> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand P4000 G2.
>
>>> Usually, when VMs become unresponsive, the whole host where they run gets
>>> unresponsive too, so that gives a hint about the problem, my bet is the
>>> culprit is somewhere on the host side and not on the VMs side.
>>
>> Probably the vm became unresponsive because connection to the host was
>> lost.
>
>
> I forgot to mention that less commonly we have situations where the host
> doesn't get unresponsive but the VMs on it do and they don't become
> responsive ever again, so we have to forcibly power them off and start them
> on a different host. But in this case the connection with the host doesn't
> ever get lost (so basically the host is Up, but any VM run on them is
> unresponsive).
>
>
>>> When that
>>> happens, the host itself gets non-responsive and only recoverable after
>>> reboot, since it's unable to reconnect.
>>
>> Piotr, can you check engine log and explain why host is not reconnected?
>>
>>> I must say this is not specific to
>>> this oVirt version, when we were using v.3.6.4 the same happened, and
>>> it's
>>> also worthy mentioning we've not done any configuration changes and
>>> everything had been working quite well for a long time.
>>>
>>> We were monitoring our ds1 and ds2 physical backend to see performance
>>> and
>>> we suspect we've run out of IOPS since we're reaching the maximum
>>> specified
>>> by the manufacturer, probably at certain times the host cannot perform a
>>> storage operation within some time limit and it marks VMs as
>>> unresponsive.
>>> That's why we've set up ds3 and we're migrating ds1 and ds2 to ds3. When
>>> we
>>> run out of space on ds3 we'll create more smaller volumes to keep
>>> migrating.
>>>
>>> On the host side, when this happens, we've run repoplot on the vdsm log
>>> and
>>> I'm attaching the result. Clearly there's a *huge* LVM response time (~30
>>> secs.).
>>
>> Indeed the log show very slow vgck and vgs commands - these are called
>> every
>> 5 minutes for checking the vg health and refreshing vdsm lvm cache.
>>
>> 1. starting vgck
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' devices
>> { pre
>> ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>> write_cache_state=0 disable_after_error_count=3 filter = [
>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\
>> '', '\''r|.*|'\'' ] }  global {  locking_type=1
>> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
>> retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
>> 9c-8eee-10368647c413 (cwd None)
>>
>> 2. vgck ends after 55 seconds
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '
>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>> enabling it!\n'; <rc> = 0
>>
>> 3. starting vgs
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset
>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices
>> { pref
>> erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>> write_cache_state=0 disable_after_error_count=3 filter = [
>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/de
>>
>> v/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
>> '\''r|.*|'\'' ] }  global {
>>   locking_type=1  prioritise_write_locks=1  wait_for_locks=1
>> use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
>> --noheadings --units b --nosuffix --separator '|
>> ' --ignoreskippedcluster -o
>>
>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>> 5de4a000-a9c4-489c-8eee-10368
>> 647c413 (cwd None)
>>
>> 4. vgs finished after 37 seconds
>>
>> Thread-96::DEBUG::2016-04-29
>> 13:17:48,680::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '
>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>> enabling it!\n'; <rc> = 0
>>
>> Zdenek, how do you suggest to debug this slow lvm commands?
>>
>> Can you run the following commands on a host in trouble, and on some other
>> hosts in the same timeframe?
>>
>> time vgck -vvvv --config ' devices { filter =
>> ['\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\'',
>> '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1
>> wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50
>> retain_days = 0 } ' 5de4a000-a9c4-489c-8eee-10368647c413
>>
>> time vgs -vvvv --config ' global { locking_type=1
>> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {
>> retain_min = 50  retain_days = 0 } '
>> 5de4a000-a9c4-489c-8eee-10368647c413
>>
>> Note that I added -vvvv to both commands, this will generate huge amount
>> of debugging info. Please share the output of these commands.
>>
>> You may need to fix the commands. You can always copy and paste directly
>> from vdsm log into the shell and add the -vvvv flag.
>>
>
> Indeed, there seems to be a big difference on hosts 5 and 6. I'm attaching
> the results of the execution of both commands on all hosts. Both commands
> show a pretty bigger output on hosts 5 and 6, and also a much bigger
> execution time. Times are also attached in a file called TIMES.
>
>>> Our host storage network is correctly configured and on a 1G
>>> interface, no errors on the host itself, switches, etc.
>>
>> 1G serving about 20-70 vms per host? (495 vms, 120 always up, 7 hosts)?
>>
>> Do you have separate network for management and storage, or both
>> use this 1G interface?
>
>
> Yes, we have separate networks for management, storage and motion. Storage
> and motion have 1G each (plus, for storage we use a bond of 2 interfaces in
> ALB-mode (6)). Currently no host has more than 30 VMs at a time.
>
>>> We've also limited storage in QoS to use 10MB/s and 40 IOPS,
>>
>> How did you configure this limit?
>
>
> In the Data Center tab, I chose our DC and a QoS sub-tab appears, just like
> described here: http://www.ovirt.org/documentation/sla/network-qos/
>
>>
>>> but this issue
>>> still happens, which leads me to be concerned whether this is not just an
>>> IOPS issue; each host handles about cca. 600 LVs. Could this be an issue
>>> too? I remark the LVM response times are low in normal conditions (~1-2
>>> seconds).
>>
>> We recommend to limit number lvs per vg to 350. If you have 276 disks on
>> ds1, and the disks are using snapshots, you may have too many lvs, which
>> can cause slowdowns in lvm operations.
>>
>> Can you share the output of:
>>
>> vdsm-tool dump-volume-chains 5de4a000-a9c4-489c-8eee-10368647c413
>
>
> Ok, right now no VG has more than 350 VGs so I guess this is not currently
> the problem. Unfortunately, I run the vdsm-tool command but it didn't end
> nor provide any output in cca. 1 hour, so I guess it was hanged and I
> stopped it. If you confirm this is a normal execution time I can leave it
> running whatever time it takes.
>
> host5:~# vgs
>   VG                                   #PV #LV #SN Attr   VSize VFree
>   0927c050-6fb6-463c-bb37-8b8da641dcd3   1  63   0 wz--n- 499,62g 206,50g
>   5de4a000-a9c4-489c-8eee-10368647c413   1 335   0 wz--n-   2,00t 518,62g
>   b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8   1 215   0 wz--n-   2,00t 495,25g
>   sys_vg                                 1   6   0 wz--n- 136,44g 122,94g
>>>
>>> I'm attaching the vdsm.log, engine.log and repoplot PDF;
>>
>> This is very useful, thanks. Can you send also the vdsm logs and repoplots
>> from other hosts for the same timeframe?
>>
>>> if someone could
>>> give a hint on some additional problems in them and shed some light on
>>> the
>>> above thoughts I'd be very grateful.
>>
>> Do you have sar configured on the host? having sar logs can reveal more
>> info about this timeframe.
>>
>> Do you have information about amount of io from vms during this timeframe?
>> amount of io on the storage backend during this timeframe?
>
>
> Not currently, but I'll be alert for the time this happens again and I'll
> check some sar commands related to I/O and I'll provide feedback.
> Nevertheless, by the time I run the commands above no machine was
> unresponsive and I'm still getting such huge execution times. I tried
> running iotop now and I see there are 2 vgck and vgs processes with a rate
> of ~500Kb/s each for reading.
>
> root     24446  1.3  0.0  57008  6824 ?        D<   09:15   0:00
> /usr/sbin/lvm vgs --config  devices { preferred_names = ["^/dev/mapper/"]
> ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3
> filter = [
> 'a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/dev/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|',
> 'r|.*|' ] }  global {  locking_type=1  prioritise_write_locks=1
> wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50 retain_days = 0
> }  --noheadings --units b --nosuffix --separator | --ignoreskippedcluster -o
> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
> b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d
>
> I'm also attaching the iostat output for host 5 in a file called
> iostat-host5.

Interesting, all hosts are checking the same storage, but on host5 and host6
the output of vgs and vgck is 10 times bigger - this explain why they take about
10 times longer to run.

$ ls -lh vgs-*
-rw-r--r--. 1 nsoffer nsoffer 221K Apr 30 08:55 vgs-host1
-rw-r--r--. 1 nsoffer nsoffer 239K Apr 30 08:55 vgs-host2
-rw-r--r--. 1 nsoffer nsoffer 228K Apr 30 08:55 vgs-host3
-rw-r--r--. 1 nsoffer nsoffer 288K Apr 30 08:55 vgs-host4
-rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host5
-rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host6
-rw-r--r--. 1 nsoffer nsoffer 232K Apr 30 08:55 vgs-host7
$ ls -lh vgck-*
-rw-r--r--. 1 nsoffer nsoffer 238K Apr 30 08:55 vgck-host1
-rw-r--r--. 1 nsoffer nsoffer 254K Apr 30 08:55 vgck-host2
-rw-r--r--. 1 nsoffer nsoffer 244K Apr 30 08:55 vgck-host3
-rw-r--r--. 1 nsoffer nsoffer 298K Apr 30 08:55 vgck-host4
-rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host5
-rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host6
-rw-r--r--. 1 nsoffer nsoffer 248K Apr 30 08:55 vgck-host7

Can you do collect the output of "dmsetup table -v" on host 1, 5, and 6?

Nir



More information about the Users mailing list