On Sat, Apr 30, 2016 at 10:28 PM, Nir Soffer
<nsoffer(a)redhat.com>
wrote:
> On Sat, Apr 30, 2016 at 7:16 PM, <nicolas(a)devels.es> wrote:
>> El 2016-04-30 16:55, Nir Soffer escribió:
>>>
>>> On Sat, Apr 30, 2016 at 11:33 AM, Nicolás <nicolas(a)devels.es> wrote:
>>>>
>>>> Hi Nir,
>>>>
>>>> El 29/04/16 a las 22:34, Nir Soffer escribió:
>>>>>
>>>>>
>>>>> On Fri, Apr 29, 2016 at 9:17 PM, <nicolas(a)devels.es> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We're running oVirt 3.6.5.3-1 and lately we're
experiencing some
>>>>>> issues
>>>>>> with
>>>>>> some VMs being paused because they're marked as
non-responsive.
>>>>>> Mostly,
>>>>>> after a few seconds they recover, but we want to debug precisely
>>>>>> this
>>>>>> problem so we can fix it consistently.
>>>>>>
>>>>>> Our scenario is the following:
>>>>>>
>>>>>> ~495 VMs, of which ~120 are constantly up
>>>>>> 3 datastores, all of them iSCSI-based:
>>>>>> * ds1: 2T, currently has 276 disks
>>>>>> * ds2: 2T, currently has 179 disks
>>>>>> * ds3: 500G, currently has 65 disks
>>>>>> 7 hosts: All have mostly the same hardware. CPU and memory are
>>>>>> currently
>>>>>> very lowly used (< 10%).
>>>>>>
>>>>>> ds1 and ds2 are physically the same backend which exports two
>>>>>> 2TB
>>>>>> volumes.
>>>>>> ds3 is a different storage backend where we're currently
>>>>>> migrating some
>>>>>> disks from ds1 and ds2.
>>>>>
>>>>>
>>>>> What the the storage backend behind ds1 and 2?
>>>>
>>>>
>>>>
>>>> The storage backend for ds1 and ds2 is the iSCSI-based HP LeftHand
>>>> P4000
>>>> G2.
>>>>
>>>>>> Usually, when VMs become unresponsive, the whole host where they
>>>>>> run
>>>>>> gets
>>>>>> unresponsive too, so that gives a hint about the problem, my bet
>>>>>> is the
>>>>>> culprit is somewhere on the host side and not on the VMs side.
>>>>>
>>>>>
>>>>> Probably the vm became unresponsive because connection to the host
>>>>> was
>>>>> lost.
>>>>
>>>>
>>>>
>>>> I forgot to mention that less commonly we have situations where the
>>>> host
>>>> doesn't get unresponsive but the VMs on it do and they don't
become
>>>> responsive ever again, so we have to forcibly power them off and
>>>> start
>>>> them
>>>> on a different host. But in this case the connection with the host
>>>> doesn't
>>>> ever get lost (so basically the host is Up, but any VM run on them
>>>> is
>>>> unresponsive).
>>>>
>>>>
>>>>>> When that
>>>>>> happens, the host itself gets non-responsive and only recoverable
>>>>>> after
>>>>>> reboot, since it's unable to reconnect.
>>>>>
>>>>>
>>>>> Piotr, can you check engine log and explain why host is not
>>>>> reconnected?
>>>>>
>>>>>> I must say this is not specific to
>>>>>> this oVirt version, when we were using v.3.6.4 the same happened,
>>>>>> and
>>>>>> it's
>>>>>> also worthy mentioning we've not done any configuration
changes
>>>>>> and
>>>>>> everything had been working quite well for a long time.
>>>>>>
>>>>>> We were monitoring our ds1 and ds2 physical backend to see
>>>>>> performance
>>>>>> and
>>>>>> we suspect we've run out of IOPS since we're reaching the
maximum
>>>>>> specified
>>>>>> by the manufacturer, probably at certain times the host cannot
>>>>>> perform
>>>>>> a
>>>>>> storage operation within some time limit and it marks VMs as
>>>>>> unresponsive.
>>>>>> That's why we've set up ds3 and we're migrating ds1
and ds2 to
>>>>>> ds3.
>>>>>> When
>>>>>> we
>>>>>> run out of space on ds3 we'll create more smaller volumes to
keep
>>>>>> migrating.
>>>>>>
>>>>>> On the host side, when this happens, we've run repoplot on
the
>>>>>> vdsm log
>>>>>> and
>>>>>> I'm attaching the result. Clearly there's a *huge* LVM
response
>>>>>> time
>>>>>> (~30
>>>>>> secs.).
>>>>>
>>>>>
>>>>> Indeed the log show very slow vgck and vgs commands - these are
>>>>> called
>>>>> every
>>>>> 5 minutes for checking the vg health and refreshing vdsm lvm
>>>>> cache.
>>>>>
>>>>> 1. starting vgck
>>>>>
>>>>> Thread-96::DEBUG::2016-04-29
>>>>> 13:17:48,682::lvm::290::Storage.Misc.excCmd::(cmd)
>>>>> /usr/bin/taskset
>>>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgck --config '
>>>>> devices
>>>>> { pre
>>>>> ferred_names = ["^/dev/mapper/"]
ignore_suspended_devices=1
>>>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>>>>
'\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\
>>>>> '', '\''r|.*|'\'' ] } global {
locking_type=1
>>>>> prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 }
>>>>> backup {
>>>>> retain_min = 50 retain_days = 0 } ' 5de4a000-a9c4-48
>>>>> 9c-8eee-10368647c413 (cwd None)
>>>>>
>>>>> 2. vgck ends after 55 seconds
>>>>>
>>>>> Thread-96::DEBUG::2016-04-29
>>>>> 13:18:43,173::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:
<err>
>>>>> = '
>>>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>>>> enabling it!\n'; <rc> = 0
>>>>>
>>>>> 3. starting vgs
>>>>>
>>>>> Thread-96::DEBUG::2016-04-29
>>>>> 13:17:11,963::lvm::290::Storage.Misc.excCmd::(cmd)
>>>>> /usr/bin/taskset
>>>>> --cpu-list 0-23 /usr/bin/sudo -n /usr/sbin/lvm vgs --config '
>>>>> devices
>>>>> { pref
>>>>> erred_names = ["^/dev/mapper/"] ignore_suspended_devices=1
>>>>> write_cache_state=0 disable_after_error_count=3 filter = [
>>>>> '\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/de
>>>>>
>>>>>
>>>>>
v/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|/dev/mapper/WDC_WD5003ABYZ-011FA0_WD-WMAYP0J73DU6|'\'',
>>>>> '\''r|.*|'\'' ] } global {
>>>>> locking_type=1 prioritise_write_locks=1 wait_for_locks=1
>>>>> use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } '
>>>>> --noheadings --units b --nosuffix --separator '|
>>>>> ' --ignoreskippedcluster -o
>>>>>
>>>>>
>>>>>
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>>>> 5de4a000-a9c4-489c-8eee-10368
>>>>> 647c413 (cwd None)
>>>>>
>>>>> 4. vgs finished after 37 seconds
>>>>>
>>>>> Thread-96::DEBUG::2016-04-29
>>>>> 13:17:48,680::lvm::290::Storage.Misc.excCmd::(cmd) SUCCESS:
<err>
>>>>> = '
>>>>> WARNING: lvmetad is running but disabled. Restart lvmetad before
>>>>> enabling it!\n'; <rc> = 0
>>>>>
>>>>> Zdenek, how do you suggest to debug this slow lvm commands?
>>>>>
>>>>> Can you run the following commands on a host in trouble, and on
>>>>> some
>>>>> other
>>>>> hosts in the same timeframe?
>>>>>
>>>>> time vgck -vvvv --config ' devices { filter =
>>>>>
['\''a|/dev/mapper/36000eb3a4f1acbc20000000000000043|'\'',
>>>>> '\''r|.*|'\'' ] } global { locking_type=1
>>>>> prioritise_write_locks=1
>>>>> wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50
>>>>> retain_days = 0 } ' 5de4a000-a9c4-489c-8eee-10368647c413
>>>>>
>>>>> time vgs -vvvv --config ' global { locking_type=1
>>>>> prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 }
>>>>> backup {
>>>>> retain_min = 50 retain_days = 0 } '
>>>>> 5de4a000-a9c4-489c-8eee-10368647c413
>>>>>
>>>>> Note that I added -vvvv to both commands, this will generate huge
>>>>> amount
>>>>> of debugging info. Please share the output of these commands.
>>>>>
>>>>> You may need to fix the commands. You can always copy and paste
>>>>> directly
>>>>> from vdsm log into the shell and add the -vvvv flag.
>>>>>
>>>>
>>>> Indeed, there seems to be a big difference on hosts 5 and 6. I'm
>>>> attaching
>>>> the results of the execution of both commands on all hosts. Both
>>>> commands
>>>> show a pretty bigger output on hosts 5 and 6, and also a much
>>>> bigger
>>>> execution time. Times are also attached in a file called TIMES.
>>>>
>>>>>> Our host storage network is correctly configured and on a 1G
>>>>>> interface, no errors on the host itself, switches, etc.
>>>>>
>>>>>
>>>>> 1G serving about 20-70 vms per host? (495 vms, 120 always up, 7
>>>>> hosts)?
>>>>>
>>>>> Do you have separate network for management and storage, or both
>>>>> use this 1G interface?
>>>>
>>>>
>>>>
>>>> Yes, we have separate networks for management, storage and motion.
>>>> Storage
>>>> and motion have 1G each (plus, for storage we use a bond of 2
>>>> interfaces
>>>> in
>>>> ALB-mode (6)). Currently no host has more than 30 VMs at a time.
>>>>
>>>>>> We've also limited storage in QoS to use 10MB/s and 40 IOPS,
>>>>>
>>>>>
>>>>> How did you configure this limit?
>>>>
>>>>
>>>>
>>>> In the Data Center tab, I chose our DC and a QoS sub-tab appears,
>>>> just
>>>> like
>>>> described here:
http://www.ovirt.org/documentation/sla/network-qos/
>>>>
>>>>>
>>>>>> but this issue
>>>>>> still happens, which leads me to be concerned whether this is not
>>>>>> just
>>>>>> an
>>>>>> IOPS issue; each host handles about cca. 600 LVs. Could this be
>>>>>> an
>>>>>> issue
>>>>>> too? I remark the LVM response times are low in normal conditions
>>>>>> (~1-2
>>>>>> seconds).
>>>>>
>>>>>
>>>>> We recommend to limit number lvs per vg to 350. If you have 276
>>>>> disks on
>>>>> ds1, and the disks are using snapshots, you may have too many lvs,
>>>>> which
>>>>> can cause slowdowns in lvm operations.
>>>>>
>>>>> Can you share the output of:
>>>>>
>>>>> vdsm-tool dump-volume-chains 5de4a000-a9c4-489c-8eee-10368647c413
>>>>
>>>>
>>>>
>>>> Ok, right now no VG has more than 350 VGs so I guess this is not
>>>> currently
>>>> the problem. Unfortunately, I run the vdsm-tool command but it
>>>> didn't end
>>>> nor provide any output in cca. 1 hour, so I guess it was hanged and
>>>> I
>>>> stopped it. If you confirm this is a normal execution time I can
>>>> leave it
>>>> running whatever time it takes.
>>>>
>>>> host5:~# vgs
>>>> VG #PV #LV #SN Attr VSize
>>>> VFree
>>>> 0927c050-6fb6-463c-bb37-8b8da641dcd3 1 63 0 wz--n- 499,62g
>>>> 206,50g
>>>> 5de4a000-a9c4-489c-8eee-10368647c413 1 335 0 wz--n- 2,00t
>>>> 518,62g
>>>> b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d8 1 215 0 wz--n- 2,00t
>>>> 495,25g
>>>> sys_vg 1 6 0 wz--n- 136,44g
>>>> 122,94g
>>>>>>
>>>>>>
>>>>>> I'm attaching the vdsm.log, engine.log and repoplot PDF;
>>>>>
>>>>>
>>>>> This is very useful, thanks. Can you send also the vdsm logs and
>>>>> repoplots
>>>>> from other hosts for the same timeframe?
>>>>>
>>>>>> if someone could
>>>>>> give a hint on some additional problems in them and shed some
>>>>>> light on
>>>>>> the
>>>>>> above thoughts I'd be very grateful.
>>>>>
>>>>>
>>>>> Do you have sar configured on the host? having sar logs can reveal
>>>>> more
>>>>> info about this timeframe.
>>>>>
>>>>> Do you have information about amount of io from vms during this
>>>>> timeframe?
>>>>> amount of io on the storage backend during this timeframe?
>>>>
>>>>
>>>>
>>>> Not currently, but I'll be alert for the time this happens again
>>>> and I'll
>>>> check some sar commands related to I/O and I'll provide feedback.
>>>> Nevertheless, by the time I run the commands above no machine was
>>>> unresponsive and I'm still getting such huge execution times. I
>>>> tried
>>>> running iotop now and I see there are 2 vgck and vgs processes with
>>>> a
>>>> rate
>>>> of ~500Kb/s each for reading.
>>>>
>>>> root 24446 1.3 0.0 57008 6824 ? D< 09:15 0:00
>>>> /usr/sbin/lvm vgs --config devices { preferred_names =
>>>> ["^/dev/mapper/"]
>>>> ignore_suspended_devices=1 write_cache_state=0
>>>> disable_after_error_count=3
>>>> filter = [
>>>>
>>>>
'a|/dev/mapper/36000eb3a4f1acbc20000000000000043|/dev/mapper/36000eb3a4f1acbc200000000000000b9|/dev/mapper/360014056f0dc8930d744f83af8ddc709|',
>>>> 'r|.*|' ] } global { locking_type=1 prioritise_write_locks=1
>>>> wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50
>>>> retain_days
>>>> = 0
>>>> } --noheadings --units b --nosuffix --separator |
>>>> --ignoreskippedcluster
>>>> -o
>>>>
>>>>
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
>>>> b13b9eac-1f2e-4a7e-bcd9-49f5f855c3d
>>>>
>>>> I'm also attaching the iostat output for host 5 in a file called
>>>> iostat-host5.
>>>
>>>
>>> Interesting, all hosts are checking the same storage, but on host5
>>> and
>>> host6
>>> the output of vgs and vgck is 10 times bigger - this explain why
>>> they take
>>> about
>>> 10 times longer to run.
>>>
>>> $ ls -lh vgs-*
>>> -rw-r--r--. 1 nsoffer nsoffer 221K Apr 30 08:55 vgs-host1
>>> -rw-r--r--. 1 nsoffer nsoffer 239K Apr 30 08:55 vgs-host2
>>> -rw-r--r--. 1 nsoffer nsoffer 228K Apr 30 08:55 vgs-host3
>>> -rw-r--r--. 1 nsoffer nsoffer 288K Apr 30 08:55 vgs-host4
>>> -rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host5
>>> -rw-r--r--. 1 nsoffer nsoffer 2.2M Apr 30 08:55 vgs-host6
>>> -rw-r--r--. 1 nsoffer nsoffer 232K Apr 30 08:55 vgs-host7
>>> $ ls -lh vgck-*
>>> -rw-r--r--. 1 nsoffer nsoffer 238K Apr 30 08:55 vgck-host1
>>> -rw-r--r--. 1 nsoffer nsoffer 254K Apr 30 08:55 vgck-host2
>>> -rw-r--r--. 1 nsoffer nsoffer 244K Apr 30 08:55 vgck-host3
>>> -rw-r--r--. 1 nsoffer nsoffer 298K Apr 30 08:55 vgck-host4
>>> -rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host5
>>> -rw-r--r--. 1 nsoffer nsoffer 2.0M Apr 30 08:55 vgck-host6
>>> -rw-r--r--. 1 nsoffer nsoffer 248K Apr 30 08:55 vgck-host7
>>>
>>> Can you do collect the output of "dmsetup table -v" on host 1, 5,
>>> and 6?
>>>
>>> Nir
>>
>>
>> Sure, please find attached results of the command. Once again, for
>> hosts 5
>> and 6 the size is 10 times bigger than for host 1.
>
> I think the issue causing slow vgs and vgck commands is stale
> lvs on host 5 and 6. This may be the root caused for paused vms, but I
> cannot
> explain how it is related yet.
>
> Comparing vgck on host1 and host5 - we can see that vgck opens
> 213 /dev/mapper/* devices, actually 53 uniq devices, most of them
> are opened 4 times. On host5, vgck opens 2489 devices (622 uniq).
> This explains why the operation takes about 10 times longer.
>
> Checking dmsteup table output, we can see that host1 has 53 devices,
> and host5 622 devices.
>
> Checking the device open count, host1 has 15 stale devices
> (Open count: 0), but host5 has 597 stale devices.
>
> Leaving stale devices is a known issue, but we never had evidence that
> it cause
> trouble except warnings in lvm commands.
>
> Please open an ovirt bug about this issue, and include all the files
> you sent
> so far in this thread, and all the vdsm logs on host5.
>
> To remove the stale devices, you can do:
>
> for name in `dmsetup info -c -o open,name | awk '/ 0 / {print $2}'`;
> do
> echo "removing stale device: $name"
> dmsetup remove $name
> done
Note that this cleanup is safe only in maintenance mode, since it will
also remove
the mapping for vdsm special lvs (ids, leases, inbox, outbox, master,
metatada)
that are accessed by vdsm regularly, and are not stale.
This is also not safe if you use hosted engine, hosted engine agent
should maintain
its vg.
The safest way to remove stale devices is to restart vdsm - it
deactivates all unused
lvs (except special lvs) during startup.
Hm, I run it without even putting the host on maintenance, but I will
next time. Fortunately it seems it didn't do any harm and everything is
working. I'll probably do a script that will put host on maintenance,
cleanup or restart vdsm and start again (this is not a hosted engine) if
the number of stale devices is bigger than a threshold.
Thanks again!
Nicolás
> This will eliminate the slow lvm commands issue. The vm pausing
issue
> may
> be related or not, lets see if make any difference.
>
> Nir