[Users] The SPM host node is in unresponsive mode

Haim Ateya hateya at redhat.com
Tue May 15 06:21:07 UTC 2012



----- Original Message -----
> From: "Shu Ming" <shuming at linux.vnet.ibm.com>
> To: "Haim Ateya" <hateya at redhat.com>
> Cc: "users at oVirt.org" <users at ovirt.org>
> Sent: Tuesday, May 15, 2012 9:03:42 AM
> Subject: Re: [Users] The SPM host  node is in unresponsive mode
> 
> On 2012-5-15 12:19, Haim Ateya wrote:
> >
> > ----- Original Message -----
> >> From: "Shu Ming"<shuming at linux.vnet.ibm.com>
> >> To: "users at oVirt.org"<users at ovirt.org>
> >> Sent: Tuesday, May 15, 2012 4:56:36 AM
> >> Subject: [Users] The SPM host  node is in unresponsive mode
> >>
> >> Hi,
> >>     I attached one host node in my engine.  Because it is the only
> >>     one
> >> node, it is automatically the SPM node.  And it used to run well
> >> in
> >> my
> >> engine.  Yesterday, some errors happened in the network work of
> >> the
> >> host
> >> node.  That made the node become "unresponsive" in the engine.  I
> >> am
> >> sure the network errors are fixed and want to bring the node back
> >> to
> >> life now.  However, I found that the only one node could not  be
> >> "confirm as host been rebooted" and could not be set into the
> >> maintenance mode.   The reason  given there is no active host in
> >> the
> >> datacenter and SPM can not enter into maintenance mode.  It seems
> >> that
> >> it fell into a logic loop here.  Losting network can be quite
> >> common
> >> in
> >> developing environment even in production environment, I think we
> >> should
> >> have a way to address this problem on how to repair a host node
> >> encountering network down for a while.
> > Hi Shu,
> >
> > first, for the manual fence to work ("confirm host have been
> > rebooted") you will need
> > another host in the cluster which will be used as a proxy and send
> > the actual manual fence command.
> > second, you are absolutely right, loss of network is a common
> > scenario, and we should be able
> > to recover, but lets try to understand why your host remain
> > unresponsive after network returned.
> > please ssh to the host and try the following:
> >
> > - vdsClient -s 0 getVdsCaps (validity check making sure vdsm
> > service is up and running and communicate with its network socket
> > from localhost)
> [root at ovirt-node1 ~]# vdsClient -s 0 getVdsCaps
> Connection to 9.181.129.110:54321 refused
> [root at ovirt-node1 ~]#
> 
> root at ovirt-node1 ~]# ps -ef |grep vdsm
> root      1365     1  0 09:37 ?        00:00:00 /usr/sbin/libvirtd
> --listen # by vdsm
> root      5534  4652  0 13:53 pts/0    00:00:00 grep --color=auto
> vdsm
> [root at ovirt-node1 ~]# service vdsmd start
> Redirecting to /bin/systemctl  start vdsmd.service
> 
> root at ovirt-node1 ~]# ps -ef |grep vdsm
> root      1365     1  0 09:37 ?        00:00:00 /usr/sbin/libvirtd
> --listen # by vdsm
> root      5534  4652  0 13:53 pts/0    00:00:00 grep --color=auto
> vdsm
> 
> It seems that VDSM process was gone while libvirtd spawned by VDSM
> was
> there.  Then I tried to start the VDSM daemon, however it did
> nothing.
> After checking the vdsm.log file, the latest message was five hours
> ago
> and useless.  Also, there was no useful message in libvirtd.log.

[HA] problem is systemctl doesn't show real reason why service didn't go, lets try the following: 
- # cd /lib/systemd/
- # ./systemd-vdsmd restart




> 
> 
> > - please ping between host and engine
>   It works in both ways.
> 
> 
> > - please make sure there is no firewall on blocking tcp 54321 (on
> > both host and engine)
> 
> No firewall.
> 
> >
> > also, please provide vdsm.log (from the time network issues begun)
> > and spm-lock.log (both located on /var/log/vdsm/).
> >
> > as for a mitigation, we can always manipulate db and set it
> > correctly, but first, lets try the above.
> Also, there is no useful message in spm-lock.log.  The latest message
> was 24 hours ago.
> 
> >> --
> >> Shu Ming<shuming at linux.vnet.ibm.com>
> >> IBM China Systems and Technology Laboratory
> >>
> >>
> >> _______________________________________________
> >> Users mailing list
> >> Users at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/users
> >>
> 
> 
> --
> Shu Ming<shuming at linux.vnet.ibm.com>
> IBM China Systems and Technology Laboratory
> 
> 
> 



More information about the Users mailing list