[Users] The SPM host node is in unresponsive mode
Haim Ateya
hateya at redhat.com
Tue May 15 06:21:07 UTC 2012
----- Original Message -----
> From: "Shu Ming" <shuming at linux.vnet.ibm.com>
> To: "Haim Ateya" <hateya at redhat.com>
> Cc: "users at oVirt.org" <users at ovirt.org>
> Sent: Tuesday, May 15, 2012 9:03:42 AM
> Subject: Re: [Users] The SPM host node is in unresponsive mode
>
> On 2012-5-15 12:19, Haim Ateya wrote:
> >
> > ----- Original Message -----
> >> From: "Shu Ming"<shuming at linux.vnet.ibm.com>
> >> To: "users at oVirt.org"<users at ovirt.org>
> >> Sent: Tuesday, May 15, 2012 4:56:36 AM
> >> Subject: [Users] The SPM host node is in unresponsive mode
> >>
> >> Hi,
> >> I attached one host node in my engine. Because it is the only
> >> one
> >> node, it is automatically the SPM node. And it used to run well
> >> in
> >> my
> >> engine. Yesterday, some errors happened in the network work of
> >> the
> >> host
> >> node. That made the node become "unresponsive" in the engine. I
> >> am
> >> sure the network errors are fixed and want to bring the node back
> >> to
> >> life now. However, I found that the only one node could not be
> >> "confirm as host been rebooted" and could not be set into the
> >> maintenance mode. The reason given there is no active host in
> >> the
> >> datacenter and SPM can not enter into maintenance mode. It seems
> >> that
> >> it fell into a logic loop here. Losting network can be quite
> >> common
> >> in
> >> developing environment even in production environment, I think we
> >> should
> >> have a way to address this problem on how to repair a host node
> >> encountering network down for a while.
> > Hi Shu,
> >
> > first, for the manual fence to work ("confirm host have been
> > rebooted") you will need
> > another host in the cluster which will be used as a proxy and send
> > the actual manual fence command.
> > second, you are absolutely right, loss of network is a common
> > scenario, and we should be able
> > to recover, but lets try to understand why your host remain
> > unresponsive after network returned.
> > please ssh to the host and try the following:
> >
> > - vdsClient -s 0 getVdsCaps (validity check making sure vdsm
> > service is up and running and communicate with its network socket
> > from localhost)
> [root at ovirt-node1 ~]# vdsClient -s 0 getVdsCaps
> Connection to 9.181.129.110:54321 refused
> [root at ovirt-node1 ~]#
>
> root at ovirt-node1 ~]# ps -ef |grep vdsm
> root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd
> --listen # by vdsm
> root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto
> vdsm
> [root at ovirt-node1 ~]# service vdsmd start
> Redirecting to /bin/systemctl start vdsmd.service
>
> root at ovirt-node1 ~]# ps -ef |grep vdsm
> root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd
> --listen # by vdsm
> root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto
> vdsm
>
> It seems that VDSM process was gone while libvirtd spawned by VDSM
> was
> there. Then I tried to start the VDSM daemon, however it did
> nothing.
> After checking the vdsm.log file, the latest message was five hours
> ago
> and useless. Also, there was no useful message in libvirtd.log.
[HA] problem is systemctl doesn't show real reason why service didn't go, lets try the following:
- # cd /lib/systemd/
- # ./systemd-vdsmd restart
>
>
> > - please ping between host and engine
> It works in both ways.
>
>
> > - please make sure there is no firewall on blocking tcp 54321 (on
> > both host and engine)
>
> No firewall.
>
> >
> > also, please provide vdsm.log (from the time network issues begun)
> > and spm-lock.log (both located on /var/log/vdsm/).
> >
> > as for a mitigation, we can always manipulate db and set it
> > correctly, but first, lets try the above.
> Also, there is no useful message in spm-lock.log. The latest message
> was 24 hours ago.
>
> >> --
> >> Shu Ming<shuming at linux.vnet.ibm.com>
> >> IBM China Systems and Technology Laboratory
> >>
> >>
> >> _______________________________________________
> >> Users mailing list
> >> Users at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/users
> >>
>
>
> --
> Shu Ming<shuming at linux.vnet.ibm.com>
> IBM China Systems and Technology Laboratory
>
>
>
More information about the Users
mailing list