[Users] The SPM host node is in unresponsive mode
Hi, I attached one host node in my engine. Because it is the only one node, it is automatically the SPM node. And it used to run well in my engine. Yesterday, some errors happened in the network work of the host node. That made the node become "unresponsive" in the engine. I am sure the network errors are fixed and want to bring the node back to life now. However, I found that the only one node could not be "confirm as host been rebooted" and could not be set into the maintenance mode. The reason given there is no active host in the datacenter and SPM can not enter into maintenance mode. It seems that it fell into a logic loop here. Losting network can be quite common in developing environment even in production environment, I think we should have a way to address this problem on how to repair a host node encountering network down for a while. -- Shu Ming<shuming@linux.vnet.ibm.com> IBM China Systems and Technology Laboratory
----- Original Message -----
From: "Shu Ming" <shuming@linux.vnet.ibm.com> To: "users@oVirt.org" <users@ovirt.org> Sent: Tuesday, May 15, 2012 4:56:36 AM Subject: [Users] The SPM host node is in unresponsive mode
Hi, I attached one host node in my engine. Because it is the only one node, it is automatically the SPM node. And it used to run well in my engine. Yesterday, some errors happened in the network work of the host node. That made the node become "unresponsive" in the engine. I am sure the network errors are fixed and want to bring the node back to life now. However, I found that the only one node could not be "confirm as host been rebooted" and could not be set into the maintenance mode. The reason given there is no active host in the datacenter and SPM can not enter into maintenance mode. It seems that it fell into a logic loop here. Losting network can be quite common in developing environment even in production environment, I think we should have a way to address this problem on how to repair a host node encountering network down for a while.
Hi Shu, first, for the manual fence to work ("confirm host have been rebooted") you will need another host in the cluster which will be used as a proxy and send the actual manual fence command. second, you are absolutely right, loss of network is a common scenario, and we should be able to recover, but lets try to understand why your host remain unresponsive after network returned. please ssh to the host and try the following: - vdsClient -s 0 getVdsCaps (validity check making sure vdsm service is up and running and communicate with its network socket from localhost) - please ping between host and engine - please make sure there is no firewall on blocking tcp 54321 (on both host and engine) also, please provide vdsm.log (from the time network issues begun) and spm-lock.log (both located on /var/log/vdsm/). as for a mitigation, we can always manipulate db and set it correctly, but first, lets try the above.
-- Shu Ming<shuming@linux.vnet.ibm.com> IBM China Systems and Technology Laboratory
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
On 2012-5-15 12:19, Haim Ateya wrote:
----- Original Message -----
From: "Shu Ming"<shuming@linux.vnet.ibm.com> To: "users@oVirt.org"<users@ovirt.org> Sent: Tuesday, May 15, 2012 4:56:36 AM Subject: [Users] The SPM host node is in unresponsive mode
Hi, I attached one host node in my engine. Because it is the only one node, it is automatically the SPM node. And it used to run well in my engine. Yesterday, some errors happened in the network work of the host node. That made the node become "unresponsive" in the engine. I am sure the network errors are fixed and want to bring the node back to life now. However, I found that the only one node could not be "confirm as host been rebooted" and could not be set into the maintenance mode. The reason given there is no active host in the datacenter and SPM can not enter into maintenance mode. It seems that it fell into a logic loop here. Losting network can be quite common in developing environment even in production environment, I think we should have a way to address this problem on how to repair a host node encountering network down for a while. Hi Shu,
first, for the manual fence to work ("confirm host have been rebooted") you will need another host in the cluster which will be used as a proxy and send the actual manual fence command. second, you are absolutely right, loss of network is a common scenario, and we should be able to recover, but lets try to understand why your host remain unresponsive after network returned. please ssh to the host and try the following:
- vdsClient -s 0 getVdsCaps (validity check making sure vdsm service is up and running and communicate with its network socket from localhost)
[root@ovirt-node1 ~]# vdsClient -s 0 getVdsCaps Connection to 9.181.129.110:54321 refused [root@ovirt-node1 ~]# root@ovirt-node1 ~]# ps -ef |grep vdsm root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd --listen # by vdsm root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto vdsm [root@ovirt-node1 ~]# service vdsmd start Redirecting to /bin/systemctl start vdsmd.service root@ovirt-node1 ~]# ps -ef |grep vdsm root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd --listen # by vdsm root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto vdsm It seems that VDSM process was gone while libvirtd spawned by VDSM was there. Then I tried to start the VDSM daemon, however it did nothing. After checking the vdsm.log file, the latest message was five hours ago and useless. Also, there was no useful message in libvirtd.log.
- please ping between host and engine It works in both ways.
- please make sure there is no firewall on blocking tcp 54321 (on both host and engine)
No firewall.
also, please provide vdsm.log (from the time network issues begun) and spm-lock.log (both located on /var/log/vdsm/).
as for a mitigation, we can always manipulate db and set it correctly, but first, lets try the above.
Also, there is no useful message in spm-lock.log. The latest message was 24 hours ago.
-- Shu Ming<shuming@linux.vnet.ibm.com> IBM China Systems and Technology Laboratory
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Shu Ming<shuming@linux.vnet.ibm.com> IBM China Systems and Technology Laboratory
----- Original Message -----
From: "Shu Ming" <shuming@linux.vnet.ibm.com> To: "Haim Ateya" <hateya@redhat.com> Cc: "users@oVirt.org" <users@ovirt.org> Sent: Tuesday, May 15, 2012 9:03:42 AM Subject: Re: [Users] The SPM host node is in unresponsive mode
On 2012-5-15 12:19, Haim Ateya wrote:
----- Original Message -----
From: "Shu Ming"<shuming@linux.vnet.ibm.com> To: "users@oVirt.org"<users@ovirt.org> Sent: Tuesday, May 15, 2012 4:56:36 AM Subject: [Users] The SPM host node is in unresponsive mode
Hi, I attached one host node in my engine. Because it is the only one node, it is automatically the SPM node. And it used to run well in my engine. Yesterday, some errors happened in the network work of the host node. That made the node become "unresponsive" in the engine. I am sure the network errors are fixed and want to bring the node back to life now. However, I found that the only one node could not be "confirm as host been rebooted" and could not be set into the maintenance mode. The reason given there is no active host in the datacenter and SPM can not enter into maintenance mode. It seems that it fell into a logic loop here. Losting network can be quite common in developing environment even in production environment, I think we should have a way to address this problem on how to repair a host node encountering network down for a while. Hi Shu,
first, for the manual fence to work ("confirm host have been rebooted") you will need another host in the cluster which will be used as a proxy and send the actual manual fence command. second, you are absolutely right, loss of network is a common scenario, and we should be able to recover, but lets try to understand why your host remain unresponsive after network returned. please ssh to the host and try the following:
- vdsClient -s 0 getVdsCaps (validity check making sure vdsm service is up and running and communicate with its network socket from localhost)
[root@ovirt-node1 ~]# vdsClient -s 0 getVdsCaps Connection to 9.181.129.110:54321 refused [root@ovirt-node1 ~]#
root@ovirt-node1 ~]# ps -ef |grep vdsm root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd --listen # by vdsm root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto vdsm [root@ovirt-node1 ~]# service vdsmd start Redirecting to /bin/systemctl start vdsmd.service
root@ovirt-node1 ~]# ps -ef |grep vdsm root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd --listen # by vdsm root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto vdsm
It seems that VDSM process was gone while libvirtd spawned by VDSM was there. Then I tried to start the VDSM daemon, however it did nothing. After checking the vdsm.log file, the latest message was five hours ago and useless. Also, there was no useful message in libvirtd.log.
[HA] problem is systemctl doesn't show real reason why service didn't go, lets try the following: - # cd /lib/systemd/ - # ./systemd-vdsmd restart
- please ping between host and engine It works in both ways.
- please make sure there is no firewall on blocking tcp 54321 (on both host and engine)
No firewall.
also, please provide vdsm.log (from the time network issues begun) and spm-lock.log (both located on /var/log/vdsm/).
as for a mitigation, we can always manipulate db and set it correctly, but first, lets try the above.
Also, there is no useful message in spm-lock.log. The latest message was 24 hours ago.
-- Shu Ming<shuming@linux.vnet.ibm.com> IBM China Systems and Technology Laboratory
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Shu Ming<shuming@linux.vnet.ibm.com> IBM China Systems and Technology Laboratory
On 2012-5-15 14:21, Haim Ateya wrote:
>
> ----- Original Message -----
>> From: "Shu Ming"<shuming@linux.vnet.ibm.com>
>> To: "Haim Ateya"<hateya@redhat.com>
>> Cc: "users@oVirt.org"<users@ovirt.org>
>> Sent: Tuesday, May 15, 2012 9:03:42 AM
>> Subject: Re: [Users] The SPM host node is in unresponsive mode
>>
>> On 2012-5-15 12:19, Haim Ateya wrote:
>>> ----- Original Message -----
>>>> From: "Shu Ming"<shuming@linux.vnet.ibm.com>
>>>> To: "users@oVirt.org"<users@ovirt.org>
>>>> Sent: Tuesday, May 15, 2012 4:56:36 AM
>>>> Subject: [Users] The SPM host node is in unresponsive mode
>>>>
>>>> Hi,
>>>> I attached one host node in my engine. Because it is the only
>>>> one
>>>> node, it is automatically the SPM node. And it used to run well
>>>> in
>>>> my
>>>> engine. Yesterday, some errors happened in the network work of
>>>> the
>>>> host
>>>> node. That made the node become "unresponsive" in the engine. I
>>>> am
>>>> sure the network errors are fixed and want to bring the node back
>>>> to
>>>> life now. However, I found that the only one node could not be
>>>> "confirm as host been rebooted" and could not be set into the
>>>> maintenance mode. The reason given there is no active host in
>>>> the
>>>> datacenter and SPM can not enter into maintenance mode. It seems
>>>> that
>>>> it fell into a logic loop here. Losting network can be quite
>>>> common
>>>> in
>>>> developing environment even in production environment, I think we
>>>> should
>>>> have a way to address this problem on how to repair a host node
>>>> encountering network down for a while.
>>> Hi Shu,
>>>
>>> first, for the manual fence to work ("confirm host have been
>>> rebooted") you will need
>>> another host in the cluster which will be used as a proxy and send
>>> the actual manual fence command.
>>> second, you are absolutely right, loss of network is a common
>>> scenario, and we should be able
>>> to recover, but lets try to understand why your host remain
>>> unresponsive after network returned.
>>> please ssh to the host and try the following:
>>>
>>> - vdsClient -s 0 getVdsCaps (validity check making sure vdsm
>>> service is up and running and communicate with its network socket
>>> from localhost)
>> [root@ovirt-node1 ~]# vdsClient -s 0 getVdsCaps
>> Connection to 9.181.129.110:54321 refused
>> [root@ovirt-node1 ~]#
>>
>> root@ovirt-node1 ~]# ps -ef |grep vdsm
>> root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd
>> --listen # by vdsm
>> root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto
>> vdsm
>> [root@ovirt-node1 ~]# service vdsmd start
>> Redirecting to /bin/systemctl start vdsmd.service
>>
>> root@ovirt-node1 ~]# ps -ef |grep vdsm
>> root 1365 1 0 09:37 ? 00:00:00 /usr/sbin/libvirtd
>> --listen # by vdsm
>> root 5534 4652 0 13:53 pts/0 00:00:00 grep --color=auto
>> vdsm
>>
>> It seems that VDSM process was gone while libvirtd spawned by VDSM
>> was
>> there. Then I tried to start the VDSM daemon, however it did
>> nothing.
>> After checking the vdsm.log file, the latest message was five hours
>> ago
>> and useless. Also, there was no useful message in libvirtd.log.
> [HA] problem is systemctl doesn't show real reason why service didn't go, lets try the following:
> - # cd /lib/systemd/
> - # ./systemd-vdsmd restart
>
[root@ovirt-node1 systemd]# ./systemd-vdsmd start
WARNING: no socket to connect to
vdsm: libvirt already configured for vdsm [ OK ]
Starting iscsid:
Starting libvirtd (via systemctl): [ OK ]
Stopping network (via systemctl): [ OK ]
Starting network (via systemctl): Job failed. See system logs and
'systemctl status' for details.
[FAILED]
Starting up vdsm daemon:
vdsm start [ OK ]
I did futher test on this system. After I killed the solo libivrtd
process, vdsm processs can be started without libvirtd. However, vdsm
can not work either in this way. After several round of "killall
libvirtd", "service vdsmd start", "vdsmd stop", both vdsm and
libivirtd processs now start. In summary:
1) the libvirtd started by vdsm process may stand there even after its
parent vdsm process is gone.
2) the legacy libvirtd may block the start process of vdsm service
3) vdsm service can work with the legacy libvirtd sometime without
creating a new one.
Here are my service process in the host node, please notice that the
libvirtd process is earlier than the vdsm process that means this
libvirtd was a legacy process not created by the vdsm process in this round.
The problem still exist in engine that I don't have a way to reactivate
the host node.
[root@ovirt-node1 systemd]# ps -ef |grep vdsm
root 8738 1 0 14:33 ? 00:00:00 /usr/sbin/libvirtd
--listen # by vdsm
vdsm 9900 1 0 14:35 ? 00:00:00 /bin/bash -e
/usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid
/var/run/vdsm/respawn.pid /usr/share vdsm/vdsm
vdsm 9903 9900 0 14:35 ? 00:00:01 /usr/bin/python
/usr/share/vdsm vdsm
root 9926 9903 0 14:35 ? 00:00:00 /usr/bin/sudo -n
/usr/bin/python /usr/share/vdsm/supervdsmServer.py
b0fcae59-a3cc-4591-93e1-4b9a0bdb93c5 9903
root 9927 9926 0 14:35 ? 00:00:00 /usr/bin/python
/usr/share/vdsm/supervdsmServer.py b0fcae59-a3cc-4591-93e1-4b9a0bdb93c5 9903
root 10451 4652 0 14:38 pts/0 00:00:00 grep --color=auto vdsm
[root@ovirt-node1 systemd]# ps -ef |grep vdsm
root 8738 1 0 14:33 ? 00:00:00 /usr/sbin/libvirtd
--listen # by vdsm
vdsm 9900 1 0 14:35 ? 00:00:00 /bin/bash -e
/usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid
/var/run/vdsm/respawn.pid /usr/share vdsm/vdsm
vdsm 9903 9900 0 14:35 ? 00:00:01 /usr/bin/python
/usr/share/vdsm vdsm
root 9926 9903 0 14:35 ? 00:00:00 /usr/bin/sudo -n
/usr/bin/python /usr/share/vdsm/supervdsmServer.py
b0fcae59-a3cc-4591-93e1-4b9a0bdb93c5 9903
root 9927 9926 0 14:35 ? 00:00:00 /usr/bin/python
/usr/share/vdsm/supervdsmServer.py b0fcae59-a3cc-4591-93e1-4b9a0bdb93c5 9903
root 10463 4652 0 14:38 pts/0 00:00:00 grep --color=auto vdsm
[root@ovirt-node1 systemd]# vdsClient -s 0 getVdsCaps
HBAInventory = {'iSCSI': [{'InitiatorName':
'iqn.1994-05.com.redhat:f1b658ea7af8'}], 'FC': []}
ISCSIInitiatorName = iqn.1994-05.com.redhat:f1b658ea7af8
bondings = {'bond4': {'addr': '', 'cfg': {}, 'mtu': '1500',
'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond0':
{'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [],
'hwaddr': '00:00:00:00:00:00'}, 'bond1': {'addr': '', 'cfg': {}, 'mtu':
'1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'},
'bond2': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves':
[], 'hwaddr': '00:00:00:00:00:00'}, 'bond3': {'addr': '', 'cfg': {},
'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}}
clusterLevels = ['3.0', '3.1']
cpuCores = 12
cpuFlags =
fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,pdpe1gb,rdtscp,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,nopl,xtopology,nonstop_tsc,aperfmperf,pni,pclmulqdq,dtes64,monitor,ds_cpl,vmx,smx,est,tm2,ssse3,cx16,xtpr,pdcm,pcid,dca,sse4_1,sse4_2,popcnt,aes,lahf_lm,arat,epb,dts,tpr_shadow,vnmi,flexpriority,ept,vpid,model_coreduo,model_Conroe
cpuModel = Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
cpuSockets = 2
cpuSpeed = 1596.000
emulatedMachines = ['pc-0.14', 'pc', 'fedora-13', 'pc-0.13',
'pc-0.12', 'pc-0.11', 'pc-0.10', 'isapc', 'pc-0.14', 'pc', 'fedora-13',
'pc-0.13', 'pc-0.12', 'pc-0.11', 'pc-0.10', 'isapc']
guestOverhead = 65
hooks = {'before_vm_migrate_destination': {'50_vhostmd':
{'md5': '2aa9ac48ef07de3c94e3428975e9df1a'}}, 'after_vm_destroy':
{'50_vhostmd': {'md5': '47f8d385859e4c3c96113d8ff446b261'}},
'before_vm_dehibernate': {'50_vhostmd': {'md5':
'2aa9ac48ef07de3c94e3428975e9df1a'}}, 'before_vm_start': {'50_vhostmd':
{'md5': '2aa9ac48ef07de3c94e3428975e9df1a'}, '10_faqemu': {'md5':
'c899c5a7004c29ae2234bd409ddfa39b'}}}
kvmEnabled = true
lastClient = 9.181.129.153
lastClientIface = ovirtmgmt
management_ip =
memSize = 72486
networks = {'ovirtmgmt': {'addr': '9.181.129.110', 'cfg':
{'IPADDR': '9.181.129.110', 'ONBOOT': 'yes', 'DELAY': '0', 'NETMASK':
'255.255.255.0', 'BOOTPROTO': 'static', 'DEVICE': 'ovirtmgmt', 'TYPE':
'Bridge', 'GATEWAY': '9.181.129.1'}, 'mtu': '1500', 'netmask':
'255.255.255.0', 'stp': 'off', 'bridged': True, 'gateway':
'9.181.129.1', 'ports': ['eth0']}}
nics = {'p4p1': {'hwaddr': '00:00:C9:E5:A1:36', 'netmask': '',
'speed': 0, 'addr': '', 'mtu': '1500'}, 'p4p2': {'hwaddr':
'00:00:C9:E5:A1:3A', 'netmask': '', 'speed': 0, 'addr': '', 'mtu':
'1500'}, 'eth1': {'hwaddr': '5C:F3:FC:E4:32:A2', 'netmask': '', 'speed':
0, 'addr': '', 'mtu': '1500'}, 'eth0': {'hwaddr': '5C:F3:FC:E4:32:A0',
'netmask': '', 'speed': 1000, 'addr': '', 'mtu': '1500'}}
operatingSystem = {'release': '1', 'version': '16', 'name':
'oVirt Node'}
packages2 = {'kernel': {'release': '4.fc16.x86_64',
'buildtime': 1332237940.0, 'version': '3.3.0'}, 'spice-server':
{'release': '1.fc16', 'buildtime': '1327339129', 'version': '0.10.1'},
'vdsm': {'release': '0.183.git107644d.fc16.shuming1336622293',
'buildtime': '1336622307', 'version': '4.9.6'}, 'qemu-kvm': {'release':
'4.fc16', 'buildtime': '1327954752', 'version': '0.15.1'}, 'libvirt':
{'release': '1.fc17', 'buildtime': '1333539009', 'version': '0.9.11'},
'qemu-img': {'release': '4.fc16', 'buildtime': '1327954752', 'version':
'0.15.1'}}
reservedMem = 321
software_revision = 0
software_version = 4.9
supportedProtocols = ['2.2', '2.3']
supportedRHEVMs = ['3.0']
uuid = 47D88E9A-FC0F-11E0-B09A-5CF3FCE432A0_00:00:C9:E5:A1:36
version_name = Snow Man
vlans = {}
vmTypes = ['kvm']
[root@ovirt-node1 systemd]#
>
>
>>
>>> - please ping between host and engine
>> It works in both ways.
>>
>>
>>> - please make sure there is no firewall on blocking tcp 54321 (on
>>> both host and engine)
>> No firewall.
>>
>>> also, please provide vdsm.log (from the time network issues begun)
>>> and spm-lock.log (both located on /var/log/vdsm/).
>>>
>>> as for a mitigation, we can always manipulate db and set it
>>> correctly, but first, lets try the above.
>> Also, there is no useful message in spm-lock.log. The latest message
>> was 24 hours ago.
>>
>>>> --
>>>> Shu Ming<shuming@linux.vnet.ibm.com>
>>>> IBM China Systems and Technology Laboratory
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>
>> --
>> Shu Ming<shuming@linux.vnet.ibm.com>
>> IBM China Systems and Technology Laboratory
>>
>>
>>
--
Shu Ming<shuming@linux.vnet.ibm.com>
IBM China Systems and Technology Laboratory
Some errors in service status, Is engine-notifierd critical to VDSM? Why did it say" pgrep: invalid user name: engine" [root@ovirt-node1 ~]# service --status-all /etc/init.d/ceph: ceph conf /etc/ceph/ceph.conf not found; system is not configured. # Generated by ebtables-save v1.0 on Tue May 15 14:08:06 CST 2012 *nat :PREROUTING ACCEPT :OUTPUT ACCEPT :POSTROUTING ACCEPT pgrep: invalid user name: engine /etc/init.d/engine-notifierd is stopped JAVA_EXECUTABLE or HSQLDB_JAR_PATH in '/etc/sysconfig/hsqldb' is set to a non-file. No active sessions On 2012-5-15 12:19, Haim Ateya wrote:
----- Original Message -----
From: "Shu Ming"<shuming@linux.vnet.ibm.com> To: "users@oVirt.org"<users@ovirt.org> Sent: Tuesday, May 15, 2012 4:56:36 AM Subject: [Users] The SPM host node is in unresponsive mode
Hi, I attached one host node in my engine. Because it is the only one node, it is automatically the SPM node. And it used to run well in my engine. Yesterday, some errors happened in the network work of the host node. That made the node become "unresponsive" in the engine. I am sure the network errors are fixed and want to bring the node back to life now. However, I found that the only one node could not be "confirm as host been rebooted" and could not be set into the maintenance mode. The reason given there is no active host in the datacenter and SPM can not enter into maintenance mode. It seems that it fell into a logic loop here. Losting network can be quite common in developing environment even in production environment, I think we should have a way to address this problem on how to repair a host node encountering network down for a while. Hi Shu,
first, for the manual fence to work ("confirm host have been rebooted") you will need another host in the cluster which will be used as a proxy and send the actual manual fence command. second, you are absolutely right, loss of network is a common scenario, and we should be able to recover, but lets try to understand why your host remain unresponsive after network returned. please ssh to the host and try the following:
- vdsClient -s 0 getVdsCaps (validity check making sure vdsm service is up and running and communicate with its network socket from localhost) - please ping between host and engine - please make sure there is no firewall on blocking tcp 54321 (on both host and engine)
also, please provide vdsm.log (from the time network issues begun) and spm-lock.log (both located on /var/log/vdsm/).
as for a mitigation, we can always manipulate db and set it correctly, but first, lets try the above.
-- Shu Ming<shuming@linux.vnet.ibm.com> IBM China Systems and Technology Laboratory
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Shu Ming<shuming@linux.vnet.ibm.com> IBM China Systems and Technology Laboratory
On 05/15/2012 09:14 AM, Shu Ming wrote:
Some errors in service status, Is engine-notifierd critical to VDSM? Why did it say" pgrep: invalid user name: engine"
no. engine-notifierd just sends emails to users
[root@ovirt-node1 ~]# service --status-all /etc/init.d/ceph: ceph conf /etc/ceph/ceph.conf not found; system is not configured. # Generated by ebtables-save v1.0 on Tue May 15 14:08:06 CST 2012 *nat :PREROUTING ACCEPT :OUTPUT ACCEPT :POSTROUTING ACCEPT
pgrep: invalid user name: engine /etc/init.d/engine-notifierd is stopped JAVA_EXECUTABLE or HSQLDB_JAR_PATH in '/etc/sysconfig/hsqldb' is set to a non-file. No active sessions On 2012-5-15 12:19, Haim Ateya wrote:
----- Original Message -----
From: "Shu Ming"<shuming@linux.vnet.ibm.com> To: "users@oVirt.org"<users@ovirt.org> Sent: Tuesday, May 15, 2012 4:56:36 AM Subject: [Users] The SPM host node is in unresponsive mode
Hi, I attached one host node in my engine. Because it is the only one node, it is automatically the SPM node. And it used to run well in my engine. Yesterday, some errors happened in the network work of the host node. That made the node become "unresponsive" in the engine. I am sure the network errors are fixed and want to bring the node back to life now. However, I found that the only one node could not be "confirm as host been rebooted" and could not be set into the maintenance mode. The reason given there is no active host in the datacenter and SPM can not enter into maintenance mode. It seems that it fell into a logic loop here. Losting network can be quite common in developing environment even in production environment, I think we should have a way to address this problem on how to repair a host node encountering network down for a while. Hi Shu,
first, for the manual fence to work ("confirm host have been rebooted") you will need another host in the cluster which will be used as a proxy and send the actual manual fence command. second, you are absolutely right, loss of network is a common scenario, and we should be able to recover, but lets try to understand why your host remain unresponsive after network returned. please ssh to the host and try the following:
- vdsClient -s 0 getVdsCaps (validity check making sure vdsm service is up and running and communicate with its network socket from localhost) - please ping between host and engine - please make sure there is no firewall on blocking tcp 54321 (on both host and engine)
also, please provide vdsm.log (from the time network issues begun) and spm-lock.log (both located on /var/log/vdsm/).
as for a mitigation, we can always manipulate db and set it correctly, but first, lets try the above.
-- Shu Ming<shuming@linux.vnet.ibm.com> IBM China Systems and Technology Laboratory
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
participants (3)
-
Haim Ateya -
Itamar Heim -
Shu Ming