[ovirt-users] VMs becoming non-responsive sporadically

Sat Apr 30 21:48:04 UTC 2016

El 2016-04-30 22:37, Nir Soffer escribió:
> On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer <nsoffer at redhat.com> 
> wrote:
>> On Sat, Apr 30, 2016 at 7:16 PM,  <nicolas at devels.es> wrote:
>>> El 2016-04-30 16:55, Nir Soffer escribió:
>>>> 
>>>> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás <nicolas at devels.es> wrote:
>>>>> 
>>>>> Hi Nir,
>>>>> 
>>>>> El 29/04/16 a las 22:34, Nir Soffer escribió:
>>>>>> 
>>>>>> 
>>>>>> On Fri, Apr 29, 2016 at 9:17 PM,  <nicolas at devels.es> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some 
>>>>>>> issues
>>>>>>> with
>>>>>>> some VMs being paused because they're marked as non-responsive. 
>>>>>>> Mostly,
>>>>>>> after a few seconds they recover, but we want to debug precisely 
>>>>>>> this
>>>>>>> problem so we can fix it consistently.
>>>>>>> 
>>>>>>> Our scenario is the following:
>>>>>>> 
>>>>>>> ~495 VMs, of which ~120 are constantly up
>>>>>>> 3 datastores, all of them iSCSI-based:
>>>>>>>    * ds1: 2T, currently has 276 disks
>>>>>>>    * ds2: 2T, currently has 179 disks
>>>>>>>    * ds3: 500G, currently has 65 disks
>>>>>>> 7 hosts: All have mostly the same hardware. CPU and memory are
>>>>>>> currently
>>>>>>> very lowly used (< 10%).
>>>>>>> 
>>>>>>>    ds1 and ds2 are physically the same backend which exports two 
>>>>>>> 2TB
>>>>>>> volumes.
>>>>>>> ds3 is a different storage backend where we're currently 
>>>>>>> migrating some
>>>>>>> disks from ds1 and ds2.
>>>>>> 
>>>>>> 
>>>>>> What the the storage backend behind ds1 and 2?
>>>>> 
>>>>> 
>>>>> 
>>>>> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand 
>>>>> P4000
>>>>> G2.
>>>>> 
>>>>>>> Usually, when VMs become unresponsive, the whole host where they 
>>>>>>> run
>>>>>>> gets
>>>>>>> unresponsive too, so that gives a hint about the problem, my bet 
>>>>>>> is the
>>>>>>> culprit is somewhere on the host side and not on the VMs side.
>>>>>> 
>>>>>> 
>>>>>> Probably the vm became unresponsive because connection to the host 
>>>>>> was
>>>>>> lost.
>>>>> 
>>>>> 
>>>>> 
>>>>> I forgot to mention that less commonly we have situations where the 
>>>>> host
>>>>> doesn't get unresponsive but the VMs on it do and they don't become
>>>>> responsive ever again, so we have to forcibly power them off and 
>>>>> start
>>>>> them
>>>>> on a different host. But in this case the connection with the host
>>>>> doesn't
>>>>> ever get lost (so basically the host is Up, but any VM run on them 
>>>>> is
>>>>> unresponsive).
>>>>> 
>>>>> 
>>>>>>> When that
>>>>>>> happens, the host itself gets non-responsive and only recoverable 
>>>>>>> after
>>>>>>> reboot, since it's unable to reconnect.
>>>>>> 
>>>>>> 
>>>>>> Piotr, can you check engine log and explain why host is not 
>>>>>> reconnected?
>>>>>> 
>>>>>>> I must say this is not specific to
>>>>>>> this oVirt version, when we were using v.3.6.4 the same happened, 
>>>>>>> and
>>>>>>> it's
>>>>>>> also worthy mentioning we've not done any configuration changes 
>>>>>>> and
>>>>>>> everything had been working quite well for a long time.
>>>>>>> 
>>>>>>> We were monitoring our ds1 and ds2 physical backend to see 
>>>>>>> performance
>>>>>>> and
>>>>>>> we suspect we've run out of IOPS since we're reaching the maximum
>>>>>>> specified
>>>>>>> by the manufacturer, probably at certain times the host cannot 
>>>>>>> perform
>>>>>>> a
>>>>>>> storage operation within some time limit and it marks VMs as
>>>>>>> unresponsive.
>>>>>>> That's why we've set up ds3 and we're migrating ds1 and ds2 to 
>>>>>>> ds3.
>>>>>>> When
>>>>>>> we
>>>>>>> run out of space on ds3 we'll create more smaller volumes to keep
>>>>>>> migrating.
>>>>>>> 
>>>>>>> On the host side, when this happens, we've run repoplot on the 
>>>>>>> vdsm log
>>>>>>> and
>>>>>>> I'm attaching the result. Clearly there's a *huge* LVM response 
>>>>>>> time
>>>>>>> (~30
>>>>>>> secs.).
>>>>>> 
>>>>>> 
>>>>>> Indeed the log show very slow vgck and vgs commands - these are 
>>>>>> called
>>>>>> every
>>>>>> 5 minutes for checking the vg health and refreshing vdsm lvm 
>>>>>> cache.
>>>>>> 
>>>>>> 1. starting vgck
>>>>>> 
>>>>>> Thread-96::DEBUG::2016-04-29
>>>>>> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd) 
>>>>>> /usr/bin/taskset
>>>>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config ' 
>>>>>> devices
>>>>>> { pre
>>>>>> ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>>>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>>>>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\
>>>>>> '', '\''r|.*|'\'' ] }  global {  locking_type=1
>>>>>> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  
>>>>>> backup {
>>>>>> retain_min = 50  retain_days = 0 } ' 5de4a000-a9c4-48
>>>>>> 9c-8eee-10368647c413 (cwd None)
>>>>>> 
>>>>>> 2. vgck ends after 55 seconds
>>>>>> 
>>>>>> Thread-96::DEBUG::2016-04-29
>>>>>> 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> 
>>>>>> = '
>>>>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>>>>> enabling it!\n'; <rc> = 0
>>>>>> 
>>>>>> 3. starting vgs
>>>>>> 
>>>>>> Thread-96::DEBUG::2016-04-29
>>>>>> 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd) 
>>>>>> /usr/bin/taskset
>>>>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' 
>>>>>> devices
>>>>>> { pref
>>>>>> erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>>>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>>>>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/de
>>>>>> 
>>>>>> 
>>>>>> v/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
>>>>>> '\''r|.*|'\'' ] }  global {
>>>>>>   locking_type=1  prioritise_write_locks=1  wait_for_locks=1
>>>>>> use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } '
>>>>>> --noheadings --units b --nosuffix --separator '|
>>>>>> ' --ignoreskippedcluster -o
>>>>>> 
>>>>>> 
>>>>>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>>>>> 5de4a000-a9c4-489c-8eee-10368
>>>>>> 647c413 (cwd None)
>>>>>> 
>>>>>> 4. vgs finished after 37 seconds
>>>>>> 
>>>>>> Thread-96::DEBUG::2016-04-29
>>>>>> 13:17:48,680::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err> 
>>>>>> = '
>>>>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>>>>> enabling it!\n'; <rc> = 0
>>>>>> 
>>>>>> Zdenek, how do you suggest to debug this slow lvm commands?
>>>>>> 
>>>>>> Can you run the following commands on a host in trouble, and on 
>>>>>> some
>>>>>> other
>>>>>> hosts in the same timeframe?
>>>>>> 
>>>>>> time vgck -vvvv --config ' devices { filter =
>>>>>> ['\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\'',
>>>>>> '\''r|.*|'\'' ] }  global {  locking_type=1  
>>>>>> prioritise_write_locks=1
>>>>>> wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50
>>>>>> retain_days = 0 } ' 5de4a000-a9c4-489c-8eee-10368647c413
>>>>>> 
>>>>>> time vgs -vvvv --config ' global { locking_type=1
>>>>>> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  
>>>>>> backup {
>>>>>> retain_min = 50  retain_days = 0 } '
>>>>>> 5de4a000-a9c4-489c-8eee-10368647c413
>>>>>> 
>>>>>> Note that I added -vvvv to both commands, this will generate huge 
>>>>>> amount
>>>>>> of debugging info. Please share the output of these commands.
>>>>>> 
>>>>>> You may need to fix the commands. You can always copy and paste 
>>>>>> directly
>>>>>> from vdsm log into the shell and add the -vvvv flag.
>>>>>> 
>>>>> 
>>>>> Indeed, there seems to be a big difference on hosts 5 and 6. I'm
>>>>> attaching
>>>>> the results of the execution of both commands on all hosts. Both 
>>>>> commands
>>>>> show a pretty bigger output on hosts 5 and 6, and also a much 
>>>>> bigger
>>>>> execution time. Times are also attached in a file called TIMES.
>>>>> 
>>>>>>> Our host storage network is correctly configured and on a 1G
>>>>>>> interface, no errors on the host itself, switches, etc.
>>>>>> 
>>>>>> 
>>>>>> 1G serving about 20-70 vms per host? (495 vms, 120 always up, 7 
>>>>>> hosts)?
>>>>>> 
>>>>>> Do you have separate network for management and storage, or both
>>>>>> use this 1G interface?
>>>>> 
>>>>> 
>>>>> 
>>>>> Yes, we have separate networks for management, storage and motion.
>>>>> Storage
>>>>> and motion have 1G each (plus, for storage we use a bond of 2 
>>>>> interfaces
>>>>> in
>>>>> ALB-mode (6)). Currently no host has more than 30 VMs at a time.
>>>>> 
>>>>>>> We've also limited storage in QoS to use 10MB/s and 40 IOPS,
>>>>>> 
>>>>>> 
>>>>>> How did you configure this limit?
>>>>> 
>>>>> 
>>>>> 
>>>>> In the Data Center tab, I chose our DC and a QoS sub-tab appears, 
>>>>> just
>>>>> like
>>>>> described here: http://www.ovirt.org/documentation/sla/network-qos/
>>>>> 
>>>>>> 
>>>>>>> but this issue
>>>>>>> still happens, which leads me to be concerned whether this is not 
>>>>>>> just
>>>>>>> an
>>>>>>> IOPS issue; each host handles about cca. 600 LVs. Could this be 
>>>>>>> an
>>>>>>> issue
>>>>>>> too? I remark the LVM response times are low in normal conditions 
>>>>>>> (~1-2
>>>>>>> seconds).
>>>>>> 
>>>>>> 
>>>>>> We recommend to limit number lvs per vg to 350. If you have 276 
>>>>>> disks on
>>>>>> ds1, and the disks are using snapshots, you may have too many lvs, 
>>>>>> which
>>>>>> can cause slowdowns in lvm operations.
>>>>>> 
>>>>>> Can you share the output of:
>>>>>> 
>>>>>> vdsm-tool dump-volume-chains 5de4a000-a9c4-489c-8eee-10368647c413
>>>>> 
>>>>> 
>>>>> 
>>>>> Ok, right now no VG has more than 350 VGs so I guess this is not
>>>>> currently
>>>>> the problem. Unfortunately, I run the vdsm-tool command but it 
>>>>> didn't end
>>>>> nor provide any output in cca. 1 hour, so I guess it was hanged and 
>>>>> I
>>>>> stopped it. If you confirm this is a normal execution time I can 
>>>>> leave it
>>>>> running whatever time it takes.
>>>>> 
>>>>> host5:~# vgs
>>>>>   VG                                   #PV #LV #SN Attr   VSize 
>>>>> VFree
>>>>>   0927c050-6fb6-463c-bb37-8b8da641dcd3   1  63   0 wz--n- 499,62g 
>>>>> 206,50g
>>>>>   5de4a000-a9c4-489c-8eee-10368647c413   1 335   0 wz--n-   2,00t 
>>>>> 518,62g
>>>>>   b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8   1 215   0 wz--n-   2,00t 
>>>>> 495,25g
>>>>>   sys_vg                                 1   6   0 wz--n- 136,44g 
>>>>> 122,94g
>>>>>>> 
>>>>>>> 
>>>>>>> I'm attaching the vdsm.log, engine.log and repoplot PDF;
>>>>>> 
>>>>>> 
>>>>>> This is very useful, thanks. Can you send also the vdsm logs and
>>>>>> repoplots
>>>>>> from other hosts for the same timeframe?
>>>>>> 
>>>>>>> if someone could
>>>>>>> give a hint on some additional problems in them and shed some 
>>>>>>> light on
>>>>>>> the
>>>>>>> above thoughts I'd be very grateful.
>>>>>> 
>>>>>> 
>>>>>> Do you have sar configured on the host? having sar logs can reveal 
>>>>>> more
>>>>>> info about this timeframe.
>>>>>> 
>>>>>> Do you have information about amount of io from vms during this
>>>>>> timeframe?
>>>>>> amount of io on the storage backend during this timeframe?
>>>>> 
>>>>> 
>>>>> 
>>>>> Not currently, but I'll be alert for the time this happens again 
>>>>> and I'll
>>>>> check some sar commands related to I/O and I'll provide feedback.
>>>>> Nevertheless, by the time I run the commands above no machine was
>>>>> unresponsive and I'm still getting such huge execution times. I 
>>>>> tried
>>>>> running iotop now and I see there are 2 vgck and vgs processes with 
>>>>> a
>>>>> rate
>>>>> of ~500Kb/s each for reading.
>>>>> 
>>>>> root     24446  1.3  0.0  57008  6824 ?        D<   09:15   0:00
>>>>> /usr/sbin/lvm vgs --config  devices { preferred_names = 
>>>>> ["^/dev/mapper/"]
>>>>> ignore_suspended_devices=1 write_cache_state=0
>>>>> disable_after_error_count=3
>>>>> filter = [
>>>>> 
>>>>> 'a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/dev/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|',
>>>>> 'r|.*|' ] }  global {  locking_type=1  prioritise_write_locks=1
>>>>> wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50 
>>>>> retain_days
>>>>> = 0
>>>>> }  --noheadings --units b --nosuffix --separator | 
>>>>> --ignoreskippedcluster
>>>>> -o
>>>>> 
>>>>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>>>> b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d
>>>>> 
>>>>> I'm also attaching the iostat output for host 5 in a file called
>>>>> iostat-host5.
>>>> 
>>>> 
>>>> Interesting, all hosts are checking the same storage, but on host5 
>>>> and
>>>> host6
>>>> the output of vgs and vgck is 10 times bigger - this explain why 
>>>> they take
>>>> about
>>>> 10 times longer to run.
>>>> 
>>>> $ ls -lh vgs-*
>>>> -rw-r--r--. 1 nsoffer nsoffer 221K Apr 30 08:55 vgs-host1
>>>> -rw-r--r--. 1 nsoffer nsoffer 239K Apr 30 08:55 vgs-host2
>>>> -rw-r--r--. 1 nsoffer nsoffer 228K Apr 30 08:55 vgs-host3
>>>> -rw-r--r--. 1 nsoffer nsoffer 288K Apr 30 08:55 vgs-host4
>>>> -rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host5
>>>> -rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host6
>>>> -rw-r--r--. 1 nsoffer nsoffer 232K Apr 30 08:55 vgs-host7
>>>> $ ls -lh vgck-*
>>>> -rw-r--r--. 1 nsoffer nsoffer 238K Apr 30 08:55 vgck-host1
>>>> -rw-r--r--. 1 nsoffer nsoffer 254K Apr 30 08:55 vgck-host2
>>>> -rw-r--r--. 1 nsoffer nsoffer 244K Apr 30 08:55 vgck-host3
>>>> -rw-r--r--. 1 nsoffer nsoffer 298K Apr 30 08:55 vgck-host4
>>>> -rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host5
>>>> -rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host6
>>>> -rw-r--r--. 1 nsoffer nsoffer 248K Apr 30 08:55 vgck-host7
>>>> 
>>>> Can you do collect the output of "dmsetup table -v" on host 1, 5, 
>>>> and 6?
>>>> 
>>>> Nir
>>> 
>>> 
>>> Sure, please find attached results of the command. Once again, for 
>>> hosts 5
>>> and 6 the size is 10 times bigger than for host 1.
>> 
>> I think the issue causing slow vgs and vgck commands is stale
>> lvs on host 5 and 6. This may be the root caused for paused vms, but I 
>> cannot
>> explain how it is related yet.
>> 
>> Comparing vgck on host1 and host5 - we can see that vgck opens
>> 213 /dev/mapper/* devices, actually 53 uniq devices, most of them
>> are opened 4 times. On host5,  vgck opens 2489 devices (622 uniq).
>> This explains why the operation takes about 10 times longer.
>> 
>> Checking dmsteup table output, we can see that host1 has 53 devices,
>> and host5 622 devices.
>> 
>> Checking the device open count, host1 has 15 stale devices
>> (Open count: 0), but host5 has 597 stale devices.
>> 
>> Leaving stale devices is a known issue, but we never had evidence that 
>> it cause
>> trouble except warnings in lvm commands.
>> 
>> Please open an ovirt bug about this issue, and include all the files 
>> you sent
>> so far in this thread, and all the vdsm logs on host5.
>> 
>> To remove the stale devices, you can do:
>> 
>> for name in `dmsetup info -c -o open,name | awk '/ 0 / {print $2}'`; 
>> do
>>     echo "removing stale device: $name"
>>     dmsetup remove $name
>> done
> 
> Note that this cleanup is safe only in maintenance mode, since it will
> also remove
> the mapping for vdsm special lvs (ids, leases, inbox, outbox, master, 
> metatada)
> that are accessed by vdsm regularly, and are not stale.
> 
> This is also not safe if you use hosted engine, hosted engine agent
> should maintain
> its vg.
> 
> The safest way to remove stale devices is to restart vdsm - it
> deactivates all unused
> lvs (except special lvs) during startup.
> 

Hm, I run it without even putting the host on maintenance, but I will 
next time. Fortunately it seems it didn't do any harm and everything is 
working. I'll probably do a script that will put host on maintenance, 
cleanup or restart vdsm and start again (this is not a hosted engine) if 
the number of stale devices is bigger than a threshold.

Thanks again!

Nicolás

>> This will eliminate the slow lvm commands issue. The vm pausing issue 
>> may
>> be related or not, lets see if make any difference.
>> 
>> Nir