[Users] Host stuck in unresponsive state

Hi, my all-in-one host is stuck in unresponsive state, no matter what I try. Found this error in my engine.log: 2013-08-31 23:53:59,325 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-54) Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused This message starts to appear when I try to activate the host (from maintenance state). Any idea how to activate the host? Running oVirt 3.3 RC2 on Fedora 19, all-in-one Setup. Thanks - Frank

On 01.09.2013 00:00, Frank Wall wrote:
2013-08-31 23:53:59,325 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-54) Command GetCapabilitiesVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused
OK, apparently vdsmd is dead: [root@aio ~]# systemctl status vdsmd.service vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled) Active: failed (Result: timeout) since So 2013-09-01 00:40:48 CEST; 53s ago Process: 1587 ExecStart=/lib/systemd/systemd-vdsmd start (code=killed, signal=TERM) Sep 01 00:39:18 aio.example.com systemd[1]: Starting Virtual Desktop Server Manager... Sep 01 00:39:19 aio.example.com systemd-vdsmd[1587]: Starting configure libvirt to VDSM ... Sep 01 00:39:19 aio.example.com systemd-vdsmd[1587]: libvirt is already configured for vdsm Sep 01 00:39:19 aio.example.com systemd-vdsmd[1587]: =Done configuring libvirt= Sep 01 00:39:19 aio.example.com systemd-vdsmd[1587]: Starting iscsid... Sep 01 00:40:48 aio.example.com systemd[1]: vdsmd.service operation timed out. Terminating. Sep 01 00:40:48 aio.example.com systemd[1]: Failed to start Virtual Desktop Server Manager. Sep 01 00:40:48 aio.example.com systemd[1]: Unit vdsmd.service entered failed state. Thanks - Frank

On 01.09.2013 00:56, Frank Wall wrote:
Sep 01 00:39:19 aio.example.com systemd-vdsmd[1587]: Starting iscsid... Sep 01 00:40:48 aio.example.com systemd[1]: vdsmd.service operation timed out. Terminating.
OK, for some reason it got stuck trying to start "iscsid" and "multipathd". I was able to solve the issues with these services and now the real error message is visible: [root@aio ~]# journalctl -xn Sep 01 01:17:53 aio.example.com python[2780]: DIGEST-MD5 client step 2 Sep 01 01:17:53 aio.example.com python[2780]: DIGEST-MD5 ask_user_info() Sep 01 01:17:53 aio.example.com python[2780]: DIGEST-MD5 make_client_response() Sep 01 01:17:53 aio.example.com python[2780]: DIGEST-MD5 client step 3 Sep 01 01:17:53 aio.example.com kernel: vdsm-tool[2781]: segfault at 7fda699aca40 ip 00007fda699aca40 sp 00007fda565b4f38 error Sep 01 01:17:53 aio.example.com systemd-vdsmd[2678]: /lib/systemd/systemd-vdsmd: line 185: 2780 Segmentation fault "$VDSM Sep 01 01:17:53 aio.example.com systemd-vdsmd[2678]: vdsm: Failed to define network filters on libvirt[FAILED] Sep 01 01:17:53 aio.example.com systemd[1]: vdsmd.service: control process exited, code=exited status=139 Sep 01 01:17:53 aio.example.com systemd[1]: Failed to start Virtual Desktop Server Manager. And from the system log: [ 1075.655610] vdsm-tool[2781]: segfault at 7fda699aca40 ip 00007fda699aca40 sp 00007fda565b4f38 error 15 in libpython2.7.so.1.0[7fda69997000+3e000] So... what do to next? Thanks - Frank

On 01.09.2013 01:28, Frank Wall wrote:
OK, for some reason it got stuck trying to start "iscsid" and "multipathd". I was able to solve the issues with these services and now the real error message is visible:
Did some more fiddling... I removed my /etc/multipath.conf and started with the new file. Apparently there is a syntax error in this auto-generated config: [root@aio ~]# multipath -ll Sep 01 00:32:27 | multipath.conf +5, invalid keyword: getuid_callout Sep 01 00:32:27 | multipath.conf +18, invalid keyword: getuid_callout OK, I removed lines 5 and 18 and now multipathd is working again. This time it was possible to successfully start vdsmd afterwards: [root@aio ~]# systemctl status vdsmd.service vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled) Active: active (running) since So 2013-09-01 16:25:45 CEST; 1min 30s ago Process: 3138 ExecStart=/lib/systemd/systemd-vdsmd start (code=exited, status=0/SUCCESS) Main PID: 3285 (respawn) CGroup: name=systemd:/system/vdsmd.service ├─3285 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /us... └─3288 /usr/bin/python /usr/share/vdsm/vdsm Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 client step 2 Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 parse_server_challenge() Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 ask_user_info() Sep 01 16:25:45 aio.exmaple.com vdsm[3288]: vdsm vds WARNING Unable to load the json rpc server module. Please make su...alled. Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 client step 2 Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 ask_user_info() Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 make_client_response() Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 client step 3 Sep 01 16:25:54 aio.exmaple.com vdsm[3288]: vdsm TaskManager.Task ERROR Task=`7fc3840c-1518-4260-9f27-ee20434b5a7a`::U... error Sep 01 16:25:54 aio.exmaple.com vdsm[3288]: vdsm TaskManager.Task ERROR Task=`82f757b5-a669-40fa-b09d-9cad90c971e1`::U... error Still, this doesn't feel right. I think vdsmd is just too unstable and vulnerable. Why did vdsmd core dump with another multipathd config in place? Why does it even have this strict dependency on multipathd? There have been severel similar reports in the last months and I wonder if there is a way to make vdsmd just more stable. It would be better to have vdsmd started and report an error to ovirt-engine, instead of failing to start the vdsmd service all the time. The current behaviour makes it hard to debug. Thanks - Frank

Hi Frank, I sometimes have (had) the same issues with all-in-one-setups, so I don't use local storage in all-in-one-setup anymore. Instead I share a directory on my node via NFS, create a new NFS datacenter and mount it locally. This might now the best way to do it, but I have better experience with this setup as with local storage. Btw, when changing multipath.conf make sure you set "RHEV PRIVATE" below "RHEV REVISION X.Y" to avoid losing your changes during next reboot. With iSCSI and FC backends vdsm is working fine in combination with multipath. In such setups multipath absolutely makes sense, but I also don't understand why multipathing is used for local storage - disks are controlled by a (hardware) raid controller and there's no alternate path oVirt could use in case of storage loss or for better throughput... Regards, René -----Original message-----
From:Frank Wall <fw@moov.de> Sent: Sunday 1st September 2013 16:40 To: users@ovirt.org Subject: Re: [Users] Host stuck in unresponsive state
On 01.09.2013 01:28, Frank Wall wrote:
OK, for some reason it got stuck trying to start "iscsid" and "multipathd". I was able to solve the issues with these services and now the real error message is visible:
Did some more fiddling... I removed my /etc/multipath.conf and started with the new file. Apparently there is a syntax error in this auto-generated config:
[root@aio ~]# multipath -ll Sep 01 00:32:27 | multipath.conf +5, invalid keyword: getuid_callout Sep 01 00:32:27 | multipath.conf +18, invalid keyword: getuid_callout
OK, I removed lines 5 and 18 and now multipathd is working again. This time it was possible to successfully start vdsmd afterwards:
[root@aio ~]# systemctl status vdsmd.service vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled) Active: active (running) since So 2013-09-01 16:25:45 CEST; 1min 30s ago Process: 3138 ExecStart=/lib/systemd/systemd-vdsmd start (code=exited, status=0/SUCCESS) Main PID: 3285 (respawn) CGroup: name=systemd:/system/vdsmd.service ├─3285 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /us... └─3288 /usr/bin/python /usr/share/vdsm/vdsm
Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 client step 2 Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 parse_server_challenge() Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 ask_user_info() Sep 01 16:25:45 aio.exmaple.com vdsm[3288]: vdsm vds WARNING Unable to load the json rpc server module. Please make su...alled. Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 client step 2 Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 ask_user_info() Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 make_client_response() Sep 01 16:25:45 aio.exmaple.com python[3288]: DIGEST-MD5 client step 3 Sep 01 16:25:54 aio.exmaple.com vdsm[3288]: vdsm TaskManager.Task ERROR Task=`7fc3840c-1518-4260-9f27-ee20434b5a7a`::U... error Sep 01 16:25:54 aio.exmaple.com vdsm[3288]: vdsm TaskManager.Task ERROR Task=`82f757b5-a669-40fa-b09d-9cad90c971e1`::U... error
Still, this doesn't feel right. I think vdsmd is just too unstable and vulnerable. Why did vdsmd core dump with another multipathd config in place? Why does it even have this strict dependency on multipathd?
There have been severel similar reports in the last months and I wonder if there is a way to make vdsmd just more stable. It would be better to have vdsmd started and report an error to ovirt-engine, instead of failing to start the vdsmd service all the time. The current behaviour makes it hard to debug.
Thanks - Frank _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi René, On 01.09.2013 19:55, René Koch wrote:
I sometimes have (had) the same issues with all-in-one-setups, so I don't use local storage in all-in-one-setup anymore.
thanks for the hint, I'll definitely give this a try if the problem persists. As of yet I found that completely disabling all blocking services helps stabilizing vdsmd-startup: [root@aio ~]# systemctl disable iscsid.service [root@aio ~]# systemctl disable multipathd.service So far vdsmd starts and runs without any issues. Thanks - Frank

Hi Frank, Can you attach vdsm+engine+messages logs so to better understand the issue? The multipath and iscsi daemons shouldn't be causing you any trouble. Also the warnings you saw from multipathd about lines 5 and 18, are just warnings, multipath knows to ignore these lines (they are there for BC). Thanks, Yeela ----- Original Message -----
From: "Frank Wall" <fw@moov.de> To: users@ovirt.org Sent: Monday, September 2, 2013 12:41:27 PM Subject: Re: [Users] Host stuck in unresponsive state
Hi René,
On 01.09.2013 19:55, René Koch wrote:
I sometimes have (had) the same issues with all-in-one-setups, so I don't use local storage in all-in-one-setup anymore.
thanks for the hint, I'll definitely give this a try if the problem persists. As of yet I found that completely disabling all blocking services helps stabilizing vdsmd-startup:
[root@aio ~]# systemctl disable iscsid.service [root@aio ~]# systemctl disable multipathd.service
So far vdsmd starts and runs without any issues.
Thanks - Frank _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
participants (3)
-
Frank Wall
-
René Koch
-
Yeela Kaplan