[ovirt-users] VMs becoming non-responsive sporadically

Fri Apr 29 18:17:20 UTC 2016

Hi,

We're running oVirt 3.6.5.3-1 and lately we're experiencing some issues 
with some VMs being paused because they're marked as non-responsive. 
Mostly, after a few seconds they recover, but we want to debug precisely 
this problem so we can fix it consistently.

Our scenario is the following:

~495 VMs, of which ~120 are constantly up
3 datastores, all of them iSCSI-based:
   * ds1: 2T, currently has 276 disks
   * ds2: 2T, currently has 179 disks
   * ds3: 500G, currently has 65 disks
7 hosts: All have mostly the same hardware. CPU and memory are currently 
very lowly used (< 10%).

   ds1 and ds2 are physically the same backend which exports two 2TB 
volumes. ds3 is a different storage backend where we're currently 
migrating some disks from ds1 and ds2.

Usually, when VMs become unresponsive, the whole host where they run 
gets unresponsive too, so that gives a hint about the problem, my bet is 
the culprit is somewhere on the host side and not on the VMs side. When 
that happens, the host itself gets non-responsive and only recoverable 
after reboot, since it's unable to reconnect. I must say this is not 
specific to this oVirt version, when we were using v.3.6.4 the same 
happened, and it's also worthy mentioning we've not done any 
configuration changes and everything had been working quite well for a 
long time.

We were monitoring our ds1 and ds2 physical backend to see performance 
and we suspect we've run out of IOPS since we're reaching the maximum 
specified by the manufacturer, probably at certain times the host cannot 
perform a storage operation within some time limit and it marks VMs as 
unresponsive. That's why we've set up ds3 and we're migrating ds1 and 
ds2 to ds3. When we run out of space on ds3 we'll create more smaller 
volumes to keep migrating.

On the host side, when this happens, we've run repoplot on the vdsm log 
and I'm attaching the result. Clearly there's a *huge* LVM response time 
(~30 secs.). Our host storage network is correctly configured and on a 
1G interface, no errors on the host itself, switches, etc.

We've also limited storage in QoS to use 10MB/s and 40 IOPS, but this 
issue still happens, which leads me to be concerned whether this is not 
just an IOPS issue; each host handles about cca. 600 LVs. Could this be 
an issue too? I remark the LVM response times are low in normal 
conditions (~1-2 seconds).

I'm attaching the vdsm.log, engine.log and repoplot PDF; if someone 
could give a hint on some additional problems in them and shed some 
light on the above thoughts I'd be very grateful.

Regards,

Nicolás
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vdsm.zip
Type: application/zip
Size: 889978 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160429/304f7215/attachment-0002.zip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: engine.zip
Type: application/zip
Size: 101272 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160429/304f7215/attachment-0003.zip>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vdsm.log.pdf
Type: application/pdf
Size: 90476 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160429/304f7215/attachment-0001.pdf>