[Users] The SPM host node is in unresponsive mode

Itamar Heim iheim at redhat.com
Tue May 15 08:36:30 UTC 2012

On 05/15/2012 09:14 AM, Shu Ming wrote:
> Some errors in service status, Is engine-notifierd critical to VDSM? Why
> did it say" pgrep: invalid user name: engine"

no. engine-notifierd just sends emails to users

> [root at ovirt-node1 ~]# service --status-all
> /etc/init.d/ceph: ceph conf /etc/ceph/ceph.conf not found; system is not
> configured.
> # Generated by ebtables-save v1.0 on Tue May 15 14:08:06 CST 2012
> *nat
> pgrep: invalid user name: engine
> /etc/init.d/engine-notifierd is stopped
> JAVA_EXECUTABLE or HSQLDB_JAR_PATH in '/etc/sysconfig/hsqldb' is set to
> a non-file.
> No active sessions
> On 2012-5-15 12:19, Haim Ateya wrote:
>> ----- Original Message -----
>>> From: "Shu Ming"<shuming at linux.vnet.ibm.com>
>>> To: "users at oVirt.org"<users at ovirt.org>
>>> Sent: Tuesday, May 15, 2012 4:56:36 AM
>>> Subject: [Users] The SPM host node is in unresponsive mode
>>> Hi,
>>> I attached one host node in my engine. Because it is the only one
>>> node, it is automatically the SPM node. And it used to run well in
>>> my
>>> engine. Yesterday, some errors happened in the network work of the
>>> host
>>> node. That made the node become "unresponsive" in the engine. I am
>>> sure the network errors are fixed and want to bring the node back to
>>> life now. However, I found that the only one node could not be
>>> "confirm as host been rebooted" and could not be set into the
>>> maintenance mode. The reason given there is no active host in the
>>> datacenter and SPM can not enter into maintenance mode. It seems
>>> that
>>> it fell into a logic loop here. Losting network can be quite common
>>> in
>>> developing environment even in production environment, I think we
>>> should
>>> have a way to address this problem on how to repair a host node
>>> encountering network down for a while.
>> Hi Shu,
>> first, for the manual fence to work ("confirm host have been
>> rebooted") you will need
>> another host in the cluster which will be used as a proxy and send the
>> actual manual fence command.
>> second, you are absolutely right, loss of network is a common
>> scenario, and we should be able
>> to recover, but lets try to understand why your host remain
>> unresponsive after network returned.
>> please ssh to the host and try the following:
>> - vdsClient -s 0 getVdsCaps (validity check making sure vdsm service
>> is up and running and communicate with its network socket from localhost)
>> - please ping between host and engine
>> - please make sure there is no firewall on blocking tcp 54321 (on both
>> host and engine)
>> also, please provide vdsm.log (from the time network issues begun) and
>> spm-lock.log (both located on /var/log/vdsm/).
>> as for a mitigation, we can always manipulate db and set it correctly,
>> but first, lets try the above.
>>> --
>>> Shu Ming<shuming at linux.vnet.ibm.com>
>>> IBM China Systems and Technology Laboratory
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users

More information about the Users mailing list