[ovirt-users] VMs becoming non-responsive sporadically
nicolas at devels.es
nicolas at devels.es
Sat Apr 30 21:48:04 UTC 2016
El 2016-04-30 22:37, Nir Soffer escribió:
> On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer <nsoffer at redhat.com>
> wrote:
>> On Sat, Apr 30, 2016 at 7:16 PM, <nicolas at devels.es> wrote:
>>> El 2016-04-30 16:55, Nir Soffer escribió:
>>>>
>>>> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás <nicolas at devels.es> wrote:
>>>>>
>>>>> Hi Nir,
>>>>>
>>>>> El 29/04/16 a las 22:34, Nir Soffer escribió:
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 29, 2016 at 9:17 PM, <nicolas at devels.es> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We're running oVirt 3.6.5.3-1 and lately we're experiencing some
>>>>>>> issues
>>>>>>> with
>>>>>>> some VMs being paused because they're marked as non-responsive.
>>>>>>> Mostly,
>>>>>>> after a few seconds they recover, but we want to debug precisely
>>>>>>> this
>>>>>>> problem so we can fix it consistently.
>>>>>>>
>>>>>>> Our scenario is the following:
>>>>>>>
>>>>>>> ~495 VMs, of which ~120 are constantly up
>>>>>>> 3 datastores, all of them iSCSI-based:
>>>>>>> * ds1: 2T, currently has 276 disks
>>>>>>> * ds2: 2T, currently has 179 disks
>>>>>>> * ds3: 500G, currently has 65 disks
>>>>>>> 7 hosts: All have mostly the same hardware. CPU and memory are
>>>>>>> currently
>>>>>>> very lowly used (< 10%).
>>>>>>>
>>>>>>> ds1 and ds2 are physically the same backend which exports two
>>>>>>> 2TB
>>>>>>> volumes.
>>>>>>> ds3 is a different storage backend where we're currently
>>>>>>> migrating some
>>>>>>> disks from ds1 and ds2.
>>>>>>
>>>>>>
>>>>>> What the the storage backend behind ds1 and 2?
>>>>>
>>>>>
>>>>>
>>>>> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand
>>>>> P4000
>>>>> G2.
>>>>>
>>>>>>> Usually, when VMs become unresponsive, the whole host where they
>>>>>>> run
>>>>>>> gets
>>>>>>> unresponsive too, so that gives a hint about the problem, my bet
>>>>>>> is the
>>>>>>> culprit is somewhere on the host side and not on the VMs side.
>>>>>>
>>>>>>
>>>>>> Probably the vm became unresponsive because connection to the host
>>>>>> was
>>>>>> lost.
>>>>>
>>>>>
>>>>>
>>>>> I forgot to mention that less commonly we have situations where the
>>>>> host
>>>>> doesn't get unresponsive but the VMs on it do and they don't become
>>>>> responsive ever again, so we have to forcibly power them off and
>>>>> start
>>>>> them
>>>>> on a different host. But in this case the connection with the host
>>>>> doesn't
>>>>> ever get lost (so basically the host is Up, but any VM run on them
>>>>> is
>>>>> unresponsive).
>>>>>
>>>>>
>>>>>>> When that
>>>>>>> happens, the host itself gets non-responsive and only recoverable
>>>>>>> after
>>>>>>> reboot, since it's unable to reconnect.
>>>>>>
>>>>>>
>>>>>> Piotr, can you check engine log and explain why host is not
>>>>>> reconnected?
>>>>>>
>>>>>>> I must say this is not specific to
>>>>>>> this oVirt version, when we were using v.3.6.4 the same happened,
>>>>>>> and
>>>>>>> it's
>>>>>>> also worthy mentioning we've not done any configuration changes
>>>>>>> and
>>>>>>> everything had been working quite well for a long time.
>>>>>>>
>>>>>>> We were monitoring our ds1 and ds2 physical backend to see
>>>>>>> performance
>>>>>>> and
>>>>>>> we suspect we've run out of IOPS since we're reaching the maximum
>>>>>>> specified
>>>>>>> by the manufacturer, probably at certain times the host cannot
>>>>>>> perform
>>>>>>> a
>>>>>>> storage operation within some time limit and it marks VMs as
>>>>>>> unresponsive.
>>>>>>> That's why we've set up ds3 and we're migrating ds1 and ds2 to
>>>>>>> ds3.
>>>>>>> When
>>>>>>> we
>>>>>>> run out of space on ds3 we'll create more smaller volumes to keep
>>>>>>> migrating.
>>>>>>>
>>>>>>> On the host side, when this happens, we've run repoplot on the
>>>>>>> vdsm log
>>>>>>> and
>>>>>>> I'm attaching the result. Clearly there's a *huge* LVM response
>>>>>>> time
>>>>>>> (~30
>>>>>>> secs.).
>>>>>>
>>>>>>
>>>>>> Indeed the log show very slow vgck and vgs commands - these are
>>>>>> called
>>>>>> every
>>>>>> 5 minutes for checking the vg health and refreshing vdsm lvm
>>>>>> cache.
>>>>>>
>>>>>> 1. starting vgck
>>>>>>
>>>>>> Thread-96::DEBUG::2016-04-29
>>>>>> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd)
>>>>>> /usr/bin/taskset
>>>>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config '
>>>>>> devices
>>>>>> { pre
>>>>>> ferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>>>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>>>>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\
>>>>>> '', '\''r|.*|'\'' ] } global { locking_type=1
>>>>>> prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 }
>>>>>> backup {
>>>>>> retain_min = 50 retain_days = 0 } ' 5de4a000-a9c4-48
>>>>>> 9c-8eee-10368647c413 (cwd None)
>>>>>>
>>>>>> 2. vgck ends after 55 seconds
>>>>>>
>>>>>> Thread-96::DEBUG::2016-04-29
>>>>>> 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err>
>>>>>> = '
>>>>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>>>>> enabling it!\n'; <rc> = 0
>>>>>>
>>>>>> 3. starting vgs
>>>>>>
>>>>>> Thread-96::DEBUG::2016-04-29
>>>>>> 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd)
>>>>>> /usr/bin/taskset
>>>>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config '
>>>>>> devices
>>>>>> { pref
>>>>>> erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>>>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>>>>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/de
>>>>>>
>>>>>>
>>>>>> v/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
>>>>>> '\''r|.*|'\'' ] } global {
>>>>>> locking_type=1 prioritise_write_locks=1 wait_for_locks=1
>>>>>> use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } '
>>>>>> --noheadings --units b --nosuffix --separator '|
>>>>>> ' --ignoreskippedcluster -o
>>>>>>
>>>>>>
>>>>>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>>>>> 5de4a000-a9c4-489c-8eee-10368
>>>>>> 647c413 (cwd None)
>>>>>>
>>>>>> 4. vgs finished after 37 seconds
>>>>>>
>>>>>> Thread-96::DEBUG::2016-04-29
>>>>>> 13:17:48,680::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS: <err>
>>>>>> = '
>>>>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>>>>> enabling it!\n'; <rc> = 0
>>>>>>
>>>>>> Zdenek, how do you suggest to debug this slow lvm commands?
>>>>>>
>>>>>> Can you run the following commands on a host in trouble, and on
>>>>>> some
>>>>>> other
>>>>>> hosts in the same timeframe?
>>>>>>
>>>>>> time vgck -vvvv --config ' devices { filter =
>>>>>> ['\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\'',
>>>>>> '\''r|.*|'\'' ] } global { locking_type=1
>>>>>> prioritise_write_locks=1
>>>>>> wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50
>>>>>> retain_days = 0 } ' 5de4a000-a9c4-489c-8eee-10368647c413
>>>>>>
>>>>>> time vgs -vvvv --config ' global { locking_type=1
>>>>>> prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 }
>>>>>> backup {
>>>>>> retain_min = 50 retain_days = 0 } '
>>>>>> 5de4a000-a9c4-489c-8eee-10368647c413
>>>>>>
>>>>>> Note that I added -vvvv to both commands, this will generate huge
>>>>>> amount
>>>>>> of debugging info. Please share the output of these commands.
>>>>>>
>>>>>> You may need to fix the commands. You can always copy and paste
>>>>>> directly
>>>>>> from vdsm log into the shell and add the -vvvv flag.
>>>>>>
>>>>>
>>>>> Indeed, there seems to be a big difference on hosts 5 and 6. I'm
>>>>> attaching
>>>>> the results of the execution of both commands on all hosts. Both
>>>>> commands
>>>>> show a pretty bigger output on hosts 5 and 6, and also a much
>>>>> bigger
>>>>> execution time. Times are also attached in a file called TIMES.
>>>>>
>>>>>>> Our host storage network is correctly configured and on a 1G
>>>>>>> interface, no errors on the host itself, switches, etc.
>>>>>>
>>>>>>
>>>>>> 1G serving about 20-70 vms per host? (495 vms, 120 always up, 7
>>>>>> hosts)?
>>>>>>
>>>>>> Do you have separate network for management and storage, or both
>>>>>> use this 1G interface?
>>>>>
>>>>>
>>>>>
>>>>> Yes, we have separate networks for management, storage and motion.
>>>>> Storage
>>>>> and motion have 1G each (plus, for storage we use a bond of 2
>>>>> interfaces
>>>>> in
>>>>> ALB-mode (6)). Currently no host has more than 30 VMs at a time.
>>>>>
>>>>>>> We've also limited storage in QoS to use 10MB/s and 40 IOPS,
>>>>>>
>>>>>>
>>>>>> How did you configure this limit?
>>>>>
>>>>>
>>>>>
>>>>> In the Data Center tab, I chose our DC and a QoS sub-tab appears,
>>>>> just
>>>>> like
>>>>> described here: http://www.ovirt.org/documentation/sla/network-qos/
>>>>>
>>>>>>
>>>>>>> but this issue
>>>>>>> still happens, which leads me to be concerned whether this is not
>>>>>>> just
>>>>>>> an
>>>>>>> IOPS issue; each host handles about cca. 600 LVs. Could this be
>>>>>>> an
>>>>>>> issue
>>>>>>> too? I remark the LVM response times are low in normal conditions
>>>>>>> (~1-2
>>>>>>> seconds).
>>>>>>
>>>>>>
>>>>>> We recommend to limit number lvs per vg to 350. If you have 276
>>>>>> disks on
>>>>>> ds1, and the disks are using snapshots, you may have too many lvs,
>>>>>> which
>>>>>> can cause slowdowns in lvm operations.
>>>>>>
>>>>>> Can you share the output of:
>>>>>>
>>>>>> vdsm-tool dump-volume-chains 5de4a000-a9c4-489c-8eee-10368647c413
>>>>>
>>>>>
>>>>>
>>>>> Ok, right now no VG has more than 350 VGs so I guess this is not
>>>>> currently
>>>>> the problem. Unfortunately, I run the vdsm-tool command but it
>>>>> didn't end
>>>>> nor provide any output in cca. 1 hour, so I guess it was hanged and
>>>>> I
>>>>> stopped it. If you confirm this is a normal execution time I can
>>>>> leave it
>>>>> running whatever time it takes.
>>>>>
>>>>> host5:~# vgs
>>>>> VG #PV #LV #SN Attr VSize
>>>>> VFree
>>>>> 0927c050-6fb6-463c-bb37-8b8da641dcd3 1 63 0 wz--n- 499,62g
>>>>> 206,50g
>>>>> 5de4a000-a9c4-489c-8eee-10368647c413 1 335 0 wz--n- 2,00t
>>>>> 518,62g
>>>>> b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8 1 215 0 wz--n- 2,00t
>>>>> 495,25g
>>>>> sys_vg 1 6 0 wz--n- 136,44g
>>>>> 122,94g
>>>>>>>
>>>>>>>
>>>>>>> I'm attaching the vdsm.log, engine.log and repoplot PDF;
>>>>>>
>>>>>>
>>>>>> This is very useful, thanks. Can you send also the vdsm logs and
>>>>>> repoplots
>>>>>> from other hosts for the same timeframe?
>>>>>>
>>>>>>> if someone could
>>>>>>> give a hint on some additional problems in them and shed some
>>>>>>> light on
>>>>>>> the
>>>>>>> above thoughts I'd be very grateful.
>>>>>>
>>>>>>
>>>>>> Do you have sar configured on the host? having sar logs can reveal
>>>>>> more
>>>>>> info about this timeframe.
>>>>>>
>>>>>> Do you have information about amount of io from vms during this
>>>>>> timeframe?
>>>>>> amount of io on the storage backend during this timeframe?
>>>>>
>>>>>
>>>>>
>>>>> Not currently, but I'll be alert for the time this happens again
>>>>> and I'll
>>>>> check some sar commands related to I/O and I'll provide feedback.
>>>>> Nevertheless, by the time I run the commands above no machine was
>>>>> unresponsive and I'm still getting such huge execution times. I
>>>>> tried
>>>>> running iotop now and I see there are 2 vgck and vgs processes with
>>>>> a
>>>>> rate
>>>>> of ~500Kb/s each for reading.
>>>>>
>>>>> root 24446 1.3 0.0 57008 6824 ? D< 09:15 0:00
>>>>> /usr/sbin/lvm vgs --config devices { preferred_names =
>>>>> ["^/dev/mapper/"]
>>>>> ignore_suspended_devices=1 write_cache_state=0
>>>>> disable_after_error_count=3
>>>>> filter = [
>>>>>
>>>>> 'a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/dev/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|',
>>>>> 'r|.*|' ] } global { locking_type=1 prioritise_write_locks=1
>>>>> wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50
>>>>> retain_days
>>>>> = 0
>>>>> } --noheadings --units b --nosuffix --separator |
>>>>> --ignoreskippedcluster
>>>>> -o
>>>>>
>>>>> uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>>>> b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d
>>>>>
>>>>> I'm also attaching the iostat output for host 5 in a file called
>>>>> iostat-host5.
>>>>
>>>>
>>>> Interesting, all hosts are checking the same storage, but on host5
>>>> and
>>>> host6
>>>> the output of vgs and vgck is 10 times bigger - this explain why
>>>> they take
>>>> about
>>>> 10 times longer to run.
>>>>
>>>> $ ls -lh vgs-*
>>>> -rw-r--r--. 1 nsoffer nsoffer 221K Apr 30 08:55 vgs-host1
>>>> -rw-r--r--. 1 nsoffer nsoffer 239K Apr 30 08:55 vgs-host2
>>>> -rw-r--r--. 1 nsoffer nsoffer 228K Apr 30 08:55 vgs-host3
>>>> -rw-r--r--. 1 nsoffer nsoffer 288K Apr 30 08:55 vgs-host4
>>>> -rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host5
>>>> -rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host6
>>>> -rw-r--r--. 1 nsoffer nsoffer 232K Apr 30 08:55 vgs-host7
>>>> $ ls -lh vgck-*
>>>> -rw-r--r--. 1 nsoffer nsoffer 238K Apr 30 08:55 vgck-host1
>>>> -rw-r--r--. 1 nsoffer nsoffer 254K Apr 30 08:55 vgck-host2
>>>> -rw-r--r--. 1 nsoffer nsoffer 244K Apr 30 08:55 vgck-host3
>>>> -rw-r--r--. 1 nsoffer nsoffer 298K Apr 30 08:55 vgck-host4
>>>> -rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host5
>>>> -rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host6
>>>> -rw-r--r--. 1 nsoffer nsoffer 248K Apr 30 08:55 vgck-host7
>>>>
>>>> Can you do collect the output of "dmsetup table -v" on host 1, 5,
>>>> and 6?
>>>>
>>>> Nir
>>>
>>>
>>> Sure, please find attached results of the command. Once again, for
>>> hosts 5
>>> and 6 the size is 10 times bigger than for host 1.
>>
>> I think the issue causing slow vgs and vgck commands is stale
>> lvs on host 5 and 6. This may be the root caused for paused vms, but I
>> cannot
>> explain how it is related yet.
>>
>> Comparing vgck on host1 and host5 - we can see that vgck opens
>> 213 /dev/mapper/* devices, actually 53 uniq devices, most of them
>> are opened 4 times. On host5, vgck opens 2489 devices (622 uniq).
>> This explains why the operation takes about 10 times longer.
>>
>> Checking dmsteup table output, we can see that host1 has 53 devices,
>> and host5 622 devices.
>>
>> Checking the device open count, host1 has 15 stale devices
>> (Open count: 0), but host5 has 597 stale devices.
>>
>> Leaving stale devices is a known issue, but we never had evidence that
>> it cause
>> trouble except warnings in lvm commands.
>>
>> Please open an ovirt bug about this issue, and include all the files
>> you sent
>> so far in this thread, and all the vdsm logs on host5.
>>
>> To remove the stale devices, you can do:
>>
>> for name in `dmsetup info -c -o open,name | awk '/ 0 / {print $2}'`;
>> do
>> echo "removing stale device: $name"
>> dmsetup remove $name
>> done
>
> Note that this cleanup is safe only in maintenance mode, since it will
> also remove
> the mapping for vdsm special lvs (ids, leases, inbox, outbox, master,
> metatada)
> that are accessed by vdsm regularly, and are not stale.
>
> This is also not safe if you use hosted engine, hosted engine agent
> should maintain
> its vg.
>
> The safest way to remove stale devices is to restart vdsm - it
> deactivates all unused
> lvs (except special lvs) during startup.
>
Hm, I run it without even putting the host on maintenance, but I will
next time. Fortunately it seems it didn't do any harm and everything is
working. I'll probably do a script that will put host on maintenance,
cleanup or restart vdsm and start again (this is not a hosted engine) if
the number of stale devices is bigger than a threshold.
Thanks again!
Nicolás
>> This will eliminate the slow lvm commands issue. The vm pausing issue
>> may
>> be related or not, lets see if make any difference.
>>
>> Nir
More information about the Users
mailing list