----- Original Message -----
From: "Shu Ming" <shuming(a)linux.vnet.ibm.com>
To: "Haim Ateya" <hateya(a)redhat.com>
Cc: "users(a)oVirt.org" <users(a)ovirt.org>
Sent: Tuesday, May 15, 2012 9:03:42 AM
Subject: Re: [Users] The SPM host node is in unresponsive mode
On 2012-5-15 12:19, Haim Ateya wrote:
>
> ----- Original Message -----
>> From: "Shu Ming"<shuming(a)linux.vnet.ibm.com>
>> To: "users@oVirt.org"<users(a)ovirt.org>
>> Sent: Tuesday, May 15, 2012 4:56:36 AM
>> Subject: [Users] The SPM host node is in unresponsive mode
>>
>> Hi,
>> I attached one host node in my engine. Because it is the only
>> one
>> node, it is automatically the SPM node. And it used to run well
>> in
>> my
>> engine. Yesterday, some errors happened in the network work of
>> the
>> host
>> node. That made the node become "unresponsive" in the engine. I
>> am
>> sure the network errors are fixed and want to bring the node back
>> to
>> life now. However, I found that the only one node could not be
>> "confirm as host been rebooted" and could not be set into the
>> maintenance mode. The reason given there is no active host in
>> the
>> datacenter and SPM can not enter into maintenance mode. It seems
>> that
>> it fell into a logic loop here. Losting network can be quite
>> common
>> in
>> developing environment even in production environment, I think we
>> should
>> have a way to address this problem on how to repair a host node
>> encountering network down for a while.
> Hi Shu,
>
> first, for the manual fence to work ("confirm host have been
> rebooted") you will need
> another host in the cluster which will be used as a proxy and send
> the actual manual fence command.
> second, you are absolutely right, loss of network is a common
> scenario, and we should be able
> to recover, but lets try to understand why your host remain
> unresponsive after network returned.
> please ssh to the host and try the following:
>
> - vdsClient -s 0 getVdsCaps (validity check making sure vdsm
> service is up and running and communicate with its network socket
> from localhost)
[root@ovirt-node1 ~]# vdsClient -s 0 getVdsCaps
Connection to 9.181.129.110:54321 refused
[root@ovirt-node1 ~]#
root@ovirt-node1 ~]# ps -ef |grep vdsm
root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd
--listen # by vdsm
root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto
vdsm
[root@ovirt-node1 ~]# service vdsmd start
Redirecting to /bin/systemctl start vdsmd.service
root@ovirt-node1 ~]# ps -ef |grep vdsm
root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd
--listen # by vdsm
root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto
vdsm
It seems that VDSM process was gone while libvirtd spawned by VDSM
was
there. Then I tried to start the VDSM daemon, however it did
nothing.
After checking the vdsm.log file, the latest message was five hours
ago
and useless. Also, there was no useful message in libvirtd.log.
[HA] problem is systemctl doesn't show real reason why service didn't go, lets try
the following:
- # cd /lib/systemd/
- # ./systemd-vdsmd restart
> - please ping between host and engine
It works in both ways.
> - please make sure there is no firewall on blocking tcp 54321 (on
> both host and engine)
No firewall.
>
> also, please provide vdsm.log (from the time network issues begun)
> and spm-lock.log (both located on /var/log/vdsm/).
>
> as for a mitigation, we can always manipulate db and set it
> correctly, but first, lets try the above.
Also, there is no useful message in spm-lock.log. The latest message
was 24 hours ago.
>> --
>> Shu Ming<shuming(a)linux.vnet.ibm.com>
>> IBM China Systems and Technology Laboratory
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org
>>
http://lists.ovirt.org/mailman/listinfo/users
>>
--
Shu Ming<shuming(a)linux.vnet.ibm.com>
IBM China Systems and Technology Laboratory