[ovirt-users] VDSM and iSCSI issue due to ulimits

12 Sep 2014

      Today I got an alert from monitoring systems that my EL7 MariaDB VM
was down.  I looked in oVirt and it showed the VM as "paused due to
storage I/O problem".  That VM is the only one currently attached to
disks that are on an iSCSI storage domain.  That iSCSI storage domain
uses iSER (over DDR Infiniband).

My instance is made up of two ovirt nodes, both with CentOS 6.5
running oVirt 3.4.3.  The host that was running this VM has 16 running
VMs total.

I looked at the vdsm logs for the hypervisor and found log entries
with "No free file handlers in pool".  Below are logs.

I quick "google" showed old posts about it being a ulimit problem.

Right now after resuming the paused VM I see this on the VDSM server

# lsof -u vdsm | wc -l
2377
# sudo -H -u vdsm -s ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1032295
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 12288
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

If the issue was due to number of open files, is it the case that if
the only VM using the iSCSI storage domain was paused, the file
descriptors were flushed?

How do I go about finding out the cause of this, so I can prevent
future issues?  Is the fix to just increase the "nofile" ulimit for
vdsm in /etc/security/limits.d/99-vdsm.conf?

Thanks,
- Trey

LOGS:

Sep 12 12:10:49 ovirtnode01 vdsm TaskManager.Task ERROR
Task=`f467714c-8948-4cc3-94af-efba6b67701c`::Unexpected
error#012Traceback (most recent call last):#012  File
"/usr/share/vdsm/storage/task.py", line 873, in _run#012    return
fn(*args, **kargs)#012  File "/usr/share/vdsm/logUtils.py", line 45,
in wrapper#012    res = f(*args, **kwargs)#012  File
"/usr/share/vdsm/storage/hsm.py", line 3055, in getVolumeSize#012
apparentsize = str(dom.getVSize(imgUUID, volUUID))#012  File
"/usr/share/vdsm/storage/fileSD.py", line 314, in getVSize#012
return self.oop.os.stat(volPath).st_size#012  File
"/usr/share/vdsm/storage/remoteFileHandler.py", line 312, in
callCrabRPCFunction#012    raise Exception("No free file handlers in
pool")#012Exception: No free file handlers in pool
Sep 12 12:10:49 ovirtnode01 vdsm vm.Vm ERROR
vmId=`967dce86-63c2-412a-97c5-d7c6f1af8dfb`::Unable to update the
volume 92c6cfd0-c236-44ab-894d-cb80421dd865 (domain:
6da59b00-0de7-4219-960b-d581b27052b5 image:
06060f34-12c5-4396-bfe3-ad0f1d4b03fc) for the drive vda
Sep 12 12:10:54 ovirtnode01 sanlock[3379]: 2014-09-12 12:10:54-0500
1773913 [5185]: s1 delta_renew read rv -202 offset 0
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data/6da59b00-0de7-4219-960b-d581b27052b5/dom_md/ids
Sep 12 12:10:54 ovirtnode01 sanlock[3379]: 2014-09-12 12:10:54-0500
1773913 [5185]: s1 renewal error -202 delta_length 10 last_success
1773883
<repeats a few times>
Sep 12 12:10:58 ovirtnode01 vdsm vm.Vm ERROR
vmId=`151972dc-0025-470e-bdfe-38d3b085a63c`::Unable to update the
volume 1b589b1b-aeb8-4a7a-a005-d5042de47f36 (domain:
6da59b00-0de7-4219-960b-d581b27052b5 image:
5d9456a3-12e1-4e50-b899-b12430a4fdb9) for the drive vda
Sep 12 12:10:58 ovirtnode01 vdsm vm.Vm ERROR
vmId=`b15f90ff-f359-447a-b309-2cf64d01d0ce`::Unable to update the
volume 70a0fee4-2dd3-4460-94da-9ddbfb66845b (domain:
6da59b00-0de7-4219-960b-d581b27052b5 image:
4e133730-f549-4b49-a8e9-baedaca4c1f1) for the drive vda
<snip>

Then lines like this:

Sep 12 12:11:49 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:49-0500
1773968 [5185]: s1 delta_renew read rv -2 offset 0
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data/6da59b00-0de7-4219-960b-d581b27052b5/dom_md/ids
Sep 12 12:11:49 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:49-0500
1773968 [5185]: s1 renewal error -2 delta_length 10 last_success
1773883
Sep 12 12:11:49 ovirtnode01 vdsm TaskManager.Task ERROR
Task=`b39a6636-451e-409c-9a81-7392000206d3`::Unexpected
error#012Traceback (most recent call last):#012  File
"/usr/share/vdsm/storage/task.py", line 873, in _run#012    return
fn(*args, **kargs)#012  File "/usr/share/vdsm/logUtils.py", line 45,
in wrapper#012    res = f(*args, **kwargs)#012  File
"/usr/share/vdsm/storage/hsm.py", line 3055, in getVolumeSize#012
apparentsize = str(dom.getVSize(imgUUID, volUUID))#012  File
"/usr/share/vdsm/storage/fileSD.py", line 314, in getVSize#012
return self.oop.os.stat(volPath).st_size#012  File
"/usr/share/vdsm/storage/remoteFileHandler.py", line 297, in
callCrabRPCFunction#012    *args, **kwargs)#012  File
"/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in
callCrabRPCFunction#012    rawLength =
self._recvAll(LENGTH_STRUCT_LENGTH, timeout)#012  File
"/usr/share/vdsm/storage/remoteFileHandler.py", line 150, in
_recvAll#012    raise Timeout()#012Timeout
Sep 12 12:11:49 ovirtnode01 vdsm vm.Vm ERROR
vmId=`ca242da9-b7ad-41e2-a26a-db2fea2e49d7`::Unable to update the
volume 03baf076-ef59-41f6-9cab-8cd64cd89307 (domain:
6da59b00-0de7-4219-960b-d581b27052b5 image:
d1960266-0a06-4b2a-a15d-141612aff740) for the drive vda
Sep 12 12:11:59 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:59-0500
1773978 [5185]: 6da59b00 close_task_aio 0 0x7f3dfc0008c0 busy
Sep 12 12:11:59 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:59-0500
1773978 [5185]: 6da59b00 close_task_aio 1 0x7f3dfc000910 busy
Sep 12 12:11:59 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:59-0500
1773978 [5185]: 6da59b00 close_task_aio 2 0x7f3dfc000960 busy

[ovirt-users] VDSM and iSCSI issue due to ulimits

Trey Dockendorf