[Users] The SPM host node is in unresponsive mode
Shu Ming
shuming at linux.vnet.ibm.com
Tue May 15 06:14:40 UTC 2012
Some errors in service status, Is engine-notifierd critical to VDSM? Why
did it say" pgrep: invalid user name: engine"
[root at ovirt-node1 ~]# service --status-all
/etc/init.d/ceph: ceph conf /etc/ceph/ceph.conf not found; system is not
configured.
# Generated by ebtables-save v1.0 on Tue May 15 14:08:06 CST 2012
*nat
:PREROUTING ACCEPT
:OUTPUT ACCEPT
:POSTROUTING ACCEPT
pgrep: invalid user name: engine
/etc/init.d/engine-notifierd is stopped
JAVA_EXECUTABLE or HSQLDB_JAR_PATH in '/etc/sysconfig/hsqldb' is set to
a non-file.
No active sessions
On 2012-5-15 12:19, Haim Ateya wrote:
>
> ----- Original Message -----
>> From: "Shu Ming"<shuming at linux.vnet.ibm.com>
>> To: "users at oVirt.org"<users at ovirt.org>
>> Sent: Tuesday, May 15, 2012 4:56:36 AM
>> Subject: [Users] The SPM host node is in unresponsive mode
>>
>> Hi,
>> I attached one host node in my engine. Because it is the only one
>> node, it is automatically the SPM node. And it used to run well in
>> my
>> engine. Yesterday, some errors happened in the network work of the
>> host
>> node. That made the node become "unresponsive" in the engine. I am
>> sure the network errors are fixed and want to bring the node back to
>> life now. However, I found that the only one node could not be
>> "confirm as host been rebooted" and could not be set into the
>> maintenance mode. The reason given there is no active host in the
>> datacenter and SPM can not enter into maintenance mode. It seems
>> that
>> it fell into a logic loop here. Losting network can be quite common
>> in
>> developing environment even in production environment, I think we
>> should
>> have a way to address this problem on how to repair a host node
>> encountering network down for a while.
> Hi Shu,
>
> first, for the manual fence to work ("confirm host have been rebooted") you will need
> another host in the cluster which will be used as a proxy and send the actual manual fence command.
> second, you are absolutely right, loss of network is a common scenario, and we should be able
> to recover, but lets try to understand why your host remain unresponsive after network returned.
> please ssh to the host and try the following:
>
> - vdsClient -s 0 getVdsCaps (validity check making sure vdsm service is up and running and communicate with its network socket from localhost)
> - please ping between host and engine
> - please make sure there is no firewall on blocking tcp 54321 (on both host and engine)
>
> also, please provide vdsm.log (from the time network issues begun) and spm-lock.log (both located on /var/log/vdsm/).
>
> as for a mitigation, we can always manipulate db and set it correctly, but first, lets try the above.
>
>> --
>> Shu Ming<shuming at linux.vnet.ibm.com>
>> IBM China Systems and Technology Laboratory
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
--
Shu Ming<shuming at linux.vnet.ibm.com>
IBM China Systems and Technology Laboratory
More information about the Users
mailing list