[ovirt-users] VMs becoming non-responsive sporadically
nicolas at devels.es
nicolas at devels.es
Fri Apr 29 18:17:20 UTC 2016
Hi,
We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues
with some VMs being paused because they're marked as non-responsive.
Mostly, after a few seconds they recover, but we want to debug precisely
this problem so we can fix it consistently.
Our scenario is the following:
~495 VMs, of which ~120 are constantly up
3 datastores, all of them iSCSI-based:
* ds1: 2T, currently has 276 disks
* ds2: 2T, currently has 179 disks
* ds3: 500G, currently has 65 disks
7 hosts: All have mostly the same hardware. CPU and memory are currently
very lowly used (< 10%).
ds1 and ds2 are physically the same backend which exports two 2TB
volumes. ds3 is a different storage backend where we're currently
migrating some disks from ds1 and ds2.
Usually, when VMs become unresponsive, the whole host where they run
gets unresponsive too, so that gives a hint about the problem, my bet is
the culprit is somewhere on the host side and not on the VMs side. When
that happens, the host itself gets non-responsive and only recoverable
after reboot, since it's unable to reconnect. I must say this is not
specific to this oVirt version, when we were using v.3.6.4 the same
happened, and it's also worthy mentioning we've not done any
configuration changes and everything had been working quite well for a
long time.
We were monitoring our ds1 and ds2 physical backend to see performance
and we suspect we've run out of IOPS since we're reaching the maximum
specified by the manufacturer, probably at certain times the host cannot
perform a storage operation within some time limit and it marks VMs as
unresponsive. That's why we've set up ds3 and we're migrating ds1 and
ds2 to ds3. When we run out of space on ds3 we'll create more smaller
volumes to keep migrating.
On the host side, when this happens, we've run repoplot on the vdsm log
and I'm attaching the result. Clearly there's a *huge* LVM response time
(~30 secs.). Our host storage network is correctly configured and on a
1G interface, no errors on the host itself, switches, etc.
We've also limited storage in QoS to use 10MB/s and 40 IOPS, but this
issue still happens, which leads me to be concerned whether this is not
just an IOPS issue; each host handles about cca. 600 LVs. Could this be
an issue too? I remark the LVM response times are low in normal
conditions (~1-2 seconds).
I'm attaching the vdsm.log, engine.log and repoplot PDF; if someone
could give a hint on some additional problems in them and shed some
light on the above thoughts I'd be very grateful.
Regards,
Nicolás
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vdsm.zip
Type: application/zip
Size: 889978 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160429/304f7215/attachment-0002.zip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: engine.zip
Type: application/zip
Size: 101272 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160429/304f7215/attachment-0003.zip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vdsm.log.pdf
Type: application/pdf
Size: 90476 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160429/304f7215/attachment-0001.pdf>
More information about the Users
mailing list