ovirt-ha-agent and too many open files error

Hello, I have a 4.0 test environment (single host with self hosted engine) where I have 6 VMs defined (5 running) and no much activity. I do't monitor this system very much. Now I have connected to it to evaluate upgrade to 4.0.1 and see that about 15 days ago the ovirt-ha-agent died because of too many open files.... [root@ractor ovirt-hosted-engine-ha]# systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Fri 2016-07-22 16:39:49 CEST; 2 weeks 4 days ago Main PID: 72795 (code=exited, status=0/SUCCESS) Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: self.set_file(fd) Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: File "/usr/lib64/python2.7/asyncore.py", line 657, in set_file Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: self.socket = file_wrapper(fd) Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: File "/usr/lib64/python2.7/asyncore.py", line 616, in __init__ Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: self.fd = os.dup(fd) Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: OSError: [Errno 24] Too many open files Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Shutting down the agent because of 3 failures in a row! Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Shutting down the agent because of 3 failures in a row! Jul 22 16:39:49 ractor.mydomain ovirt-ha-agent[72795]: WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:The VM is running locally or we have no data, keeping the domain monitor. Jul 22 16:39:49 ractor.mydomain ovirt-ha-agent[72795]: INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down Is this sort of known problem or any reason to investigate? It seems very strange to have reached this limit I presume the agent runs as vdsm user and that the oVirt installation creates the file /etc/security/limits.d/99-vdsm.conf with # This limits are intended for medium VDSM hosts, for large hosts scale these # numbers appropriately. # nproc should be the maximum amount of storage operations usage. # VMs run by "qemu" user, vm processes are not relavent to "vdsm" user limits. vdsm - nproc 4096 # nofile should be at least 3(stdin,stdour,stderr) * each external process. # 3 * 4096 = 12288 vdsm - nofile 12288 As a rough estimation (over estimation actually , due to many duplicates) I have now: # lsof -u vdsm | wc -l 488 Anything else to check? Gianluca

On Tue, Aug 9, 2016 at 4:59 PM, Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, I have a 4.0 test environment (single host with self hosted engine) where I have 6 VMs defined (5 running) and no much activity.
I do't monitor this system very much.
Now I have connected to it to evaluate upgrade to 4.0.1 and see that about 15 days ago the ovirt-ha-agent died because of too many open files....
[root@ractor ovirt-hosted-engine-ha]# systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Fri 2016-07-22 16:39:49 CEST; 2 weeks 4 days ago Main PID: 72795 (code=exited, status=0/SUCCESS)
Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: self.set_file(fd) Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: File "/usr/lib64/python2.7/asyncore.py", line 657, in set_file Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: self.socket = file_wrapper(fd) Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: File "/usr/lib64/python2.7/asyncore.py", line 616, in __init__ Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: self.fd = os.dup(fd) Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: OSError: [Errno 24] Too many open files Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Shutting down the agent because of 3 failures in a row! Jul 22 16:39:47 ractor.mydomain ovirt-ha-agent[72795]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Shutting down the agent because of 3 failures in a row! Jul 22 16:39:49 ractor.mydomain ovirt-ha-agent[72795]: WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:The VM is running locally or we have no data, keeping the domain monitor. Jul 22 16:39:49 ractor.mydomain ovirt-ha-agent[72795]: INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down
Is this sort of known problem or any reason to investigate? It seems very strange to have reached this limit
I presume the agent runs as vdsm user and that the oVirt installation creates the file /etc/security/limits.d/99-vdsm.conf
with # This limits are intended for medium VDSM hosts, for large hosts scale these # numbers appropriately.
# nproc should be the maximum amount of storage operations usage. # VMs run by "qemu" user, vm processes are not relavent to "vdsm" user limits. vdsm - nproc 4096
# nofile should be at least 3(stdin,stdour,stderr) * each external process. # 3 * 4096 = 12288 vdsm - nofile 12288
As a rough estimation (over estimation actually , due to many duplicates) I have now: # lsof -u vdsm | wc -l 488
Anything else to check?
Ciao Gianluca, can you please report which vdsm version are using there? we had a similar issue in the past but it should be already solved: https://bugzilla.redhat.com/1343005
Gianluca
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On Tue, Aug 9, 2016 at 6:54 PM, Simone Tiraboschi <stirabos@redhat.com> wrote:
Ciao Gianluca, can you please report which vdsm version are using there? we had a similar issue in the past but it should be already solved: https://bugzilla.redhat.com/1343005
Gianluca
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Ciao, the version installed at the moment is vdsm-4.18.4.1-0.el7.centos.x86_64 (what provided by version 4.0 of oVirt) I see in referred bugzilla that it should be fixed in 4.0.1, the version where I want to upgrade to... Gianluca
participants (2)
-
Gianluca Cecchi
-
Simone Tiraboschi