[ovirt-devel] oVirt node 3.6 and CPU load indefinitely stuck on 100% while vdsmd indefinitely tries to restart

Douglas Schilling Landgraf dougsland at redhat.com
Mon Jun 1 12:58:43 UTC 2015


On 06/01/2015 03:56 AM, Simone Tiraboschi wrote:
>
>
> ----- Original Message -----
>> From: "Douglas Schilling Landgraf" <dougsland at redhat.com>
>> To: "Simone Tiraboschi" <stirabos at redhat.com>, devel at ovirt.org
>> Cc: "Fabian Deutsch" <fdeutsch at redhat.com>
>> Sent: Saturday, May 30, 2015 11:28:38 PM
>> Subject: Re: oVirt node 3.6 and CPU load indefinitely stuck on 100% while vdsmd indefinitely tries to restart
>>
>> On 05/29/2015 06:44 AM, Simone Tiraboschi wrote:
>>> Hi,
>>> I tried to have hosted-engine deploying the engine appliance over oVirt
>>> node. I think it will be quite a common scenario.
>>> I tried with an oVirt node build from yesterday.
>>>
>>> Unfortunately I'm not able to conclude the setup cause oVirt node got the
>>> CPU load indefinitely stuck on 100% and so it's almost unresponsive.
>>>
>>> The issue seams to be related to vdsmd daemon witch couldn't really start
>>> and so it retries indefinitely using all the available CPU power (it also
>>> runs with niceless -20...).
>>>
>>> [root at node36 admin]# grep "Unit vdsmd.service entered failed state."
>>> /var/log/messages  | wc -l
>>> 368
>>> It tried 368 times in a row in a few minutes.
>>>
>>> With journalctl I can read:
>>> May 29 10:06:45 node36 systemd[1]: Unit vdsmd.service entered failed state.
>>> May 29 10:06:45 node36 systemd[1]: vdsmd.service holdoff time over,
>>> scheduling restart.
>>> May 29 10:06:45 node36 systemd[1]: Stopping Virtual Desktop Server
>>> Manager...
>>> May 29 10:06:45 node36 systemd[1]: Starting Virtual Desktop Server
>>> Manager...
>>> May 29 10:06:45 node36 vdsmd_init_common.sh[13697]: vdsm: Running mkdirs
>>> May 29 10:06:45 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> configure_coredump
>>> May 29 10:06:45 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> configure_vdsm_logs
>>> May 29 10:06:45 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> wait_for_network
>>> May 29 10:06:45 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> run_init_hooks
>>> May 29 10:06:46 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> upgraded_version_check
>>> May 29 10:06:46 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> check_is_configured
>>> May 29 10:06:46 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> validate_configuration
>>> May 29 10:06:47 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> prepare_transient_repository
>>> May 29 10:06:49 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> syslog_available
>>> May 29 10:06:49 node36 vdsmd_init_common.sh[13697]: vdsm: Running nwfilter
>>> May 29 10:06:50 node36 vdsmd_init_common.sh[13697]: vdsm: Running dummybr
>>> May 29 10:06:51 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> load_needed_modules
>>> May 29 10:06:51 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> tune_system
>>> May 29 10:06:51 node36 vdsmd_init_common.sh[13697]: vdsm: Running
>>> test_space
>>> May 29 10:06:51 node36 vdsmd_init_common.sh[13697]: vdsm: Running test_lo
>>> May 29 10:06:51 node36 systemd[1]: Started Virtual Desktop Server Manager.
>>> May 29 10:06:51 node36 systemd[1]: vdsmd.service: main process exited,
>>> code=exited, status=1/FAILURE
>>> May 29 10:06:51 node36 vdsmd_init_common.sh[13821]: vdsm: Running
>>> run_final_hooks
>>> May 29 10:06:52 node36 systemd[1]: Unit vdsmd.service entered failed state.
>>> May 29 10:06:52 node36 systemd[1]: vdsmd.service holdoff time over,
>>> scheduling restart.
>>> May 29 10:06:52 node36 systemd[1]: Stopping Virtual Desktop Server
>>> Manager...
>>> May 29 10:06:52 node36 systemd[1]: Starting Virtual Desktop Server
>>> Manager...
>>> repeated a lot of times
>>>
>>> /var/log/vdsm/vdsm.log is empty.
>>>
>>> while
>>> [root at node36 admin]# /usr/share/vdsm/daemonAdapter -0 /dev/null -1
>>> /dev/null -2 /dev/null /usr/share/vdsm/vdsm; echo $?
>>> 1
>>>
>>
>> Thanks for the report Simone. From my tests you are facing:
>>
>> non-root user cannot `from ovirtnode import ovirtfunctions`: permission
>> denied: '/var/log/ovirt-node.log' and '/var/log/ovirt.log
>> https://bugzilla.redhat.com/show_bug.cgi?id=1224400
>>
>> We should handle this bug very soon. The workaround is chmod o+rw in
>> /var/log/ovirt.log /var/log/ovirt-node.log
>
> OK. I tried
> [root at node36 admin]# chmod o+rw /var/log/ovirt.log /var/log/ovirt-node.log
>
> but now I'm getting:
> [root at node36 admin]# systemctl status -l vdsmd
> vdsmd.service - Virtual Desktop Server Manager
>     Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
>     Active: active (running) since Mon 2015-06-01 07:53:09 UTC; 17s ago
>    Process: 4040 ExecStopPost=/usr/libexec/vdsm/vdsmd_init_common.sh --post-stop (code=exited, status=0/SUCCESS)
>    Process: 4049 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS)
>   Main PID: 4164 (vdsm)
>     CGroup: /system.slice/vdsmd.service
>             └─4164 /usr/bin/python /usr/share/vdsm/vdsm
>
> Jun 01 07:53:07 node36 vdsmd_init_common.sh[4049]: vdsm: Running nwfilter
> Jun 01 07:53:08 node36 vdsmd_init_common.sh[4049]: vdsm: Running dummybr
> Jun 01 07:53:09 node36 vdsmd_init_common.sh[4049]: vdsm: Running load_needed_modules
> Jun 01 07:53:09 node36 vdsmd_init_common.sh[4049]: vdsm: Running tune_system
> Jun 01 07:53:09 node36 vdsmd_init_common.sh[4049]: vdsm: Running test_space
> Jun 01 07:53:09 node36 vdsmd_init_common.sh[4049]: vdsm: Running test_lo
> Jun 01 07:53:09 node36 systemd[1]: Started Virtual Desktop Server Manager.
> Jun 01 07:53:10 node36 vdsm[4164]: vdsm vds ERROR failed to init clientIF, shutting down storage dispatcher
> Jun 01 07:53:10 node36 vdsm[4164]: vdsm vds ERROR Exception raised
>                                     Traceback (most recent call last):
>                                       File "/usr/share/vdsm/vdsm", line 154, in run
>                                         serve_clients(log)
>                                       File "/usr/share/vdsm/vdsm", line 93, in serve_clients
>                                         cif = clientIF.getInstance(irs, log)
>                                       File "/usr/share/vdsm/clientIF.py", line 166, in getInstance
>                                       File "/usr/share/vdsm/clientIF.py", line 112, in __init__
>                                       File "/usr/share/vdsm/clientIF.py", line 170, in _createAcceptor
>                                       File "/usr/share/vdsm/clientIF.py", line 183, in _createSSLContext
>                                       File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 149, in __init__
>                                       File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 174, in _initContext
>                                       File "/usr/lib/python2.7/site-packages/vdsm/sslutils.py", line 153, in _loadCertChain
>                                       File "/usr/lib64/python2.7/site-packages/M2Crypto/SSL/Context.py", line 100, in load_cert_chain
>                                     SSLError: No such file or directory
> Jun 01 07:53:20 node36 vdsm[4164]: vdsm vds ERROR Vm's recovery failed
>                                     Traceback (most recent call last):
>                                       File "/usr/share/vdsm/clientIF.py", line 416, in _recoverExistingVms
>                                       File "/usr/share/vdsm/caps.py", line 177, in __init__
>                                       File "/usr/share/vdsm/caps.py", line 209, in _getCpuTopology
>                                       File "/usr/share/vdsm/caps.py", line 199, in _getFreshCapsXMLStr
>                                       File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 162, in get
>                                       File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 99, in open_connection
>                                       File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1008, in retry
>                                       File "/usr/lib64/python2.7/site-packages/libvirt.py", line 105, in openAuth
>                                     libvirtError: authentication failed: polkit: polkit\56retains_authorization_after_challenge=1
>                                     Authorization requires authentication but no agent is available.
>
> Was it just a partial workaround or am I facing a different issue?

It should be a different one, I will try to catch this one locally.

-- 
Cheers
Douglas



More information about the Devel mailing list