Ovirt-engine-ha cannot to see live status of Hosted Engine

Good day for all. I have some issues with Ovirt 4.2.6. But now the main this of it: I have two Centos 7 Nodes with same config and last Ovirt 4.2.6 with Hostedengine with disk on NFS storage. Also some of virtual machines working good. But, when HostedEngine running on one node (srv02.local) everything is fine. After migrating to another node (srv00.local), i see that agent cannot to check livelinness of HostedEngine. After few minutes HostedEngine going to reboot and after some time i see some situation. After migration to another node (srv00.local) all looks OK. hosted-engine --vm-status commang when HosterEngine on srv00 node: --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : srv02.local Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_unexpected", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : ecc7ad2d local_conf_timestamp : 78328 Host timestamp : 78328 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=78328 (Tue Sep 18 12:44:18 2018) host-id=1 score=0 vm_conf_refresh_time=78328 (Tue Sep 18 12:44:18 2018) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 03:49:58 1970 --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : srv00.local Host ID : 2 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : 1d62b106 local_conf_timestamp : 326288 Host timestamp : 326288 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=326288 (Tue Sep 18 12:44:21 2018) host-id=2 score=3400 vm_conf_refresh_time=326288 (Tue Sep 18 12:44:21 2018) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False Log agent.log from srv00.local: MainThread::INFO::2018-09-18 12:40:51,749::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:40:52,052::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-09-18 12:41:01,066::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:41:01,374::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-09-18 12:41:11,393::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2018-09-18 12:41:11,393::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(refresh) Host srv02.local.pioner.kz (id 1): {'conf_on_shared_storage': True, 'extra': 'meta data_parse_version=1\nmetadata_feature_version=1\ntimestamp=78128 (Tue Sep 18 12:40:58 2018)\nhost-id=1\ns core=0\nvm_conf_refresh_time=78128 (Tue Sep 18 12:40:58 2018)\nconf_on_shared_storage=True\nmaintenance=Fa lse\nstate=EngineUnexpectedlyDown\nstopped=False\ntimeout=Fri Jan 2 03:49:58 1970\n', 'hostname': 'srv02. local.pioner.kz', 'alive': True, 'host-id': 1, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down_unexpected', 'detail': 'unknown'}, 'score': 0, 'stopped': False, 'maintenance ': False, 'crc32': 'e18e3f22', 'local_conf_timestamp': 78128, 'host-ts': 78128} MainThread::INFO::2018-09-18 12:41:11,393::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'failed liveliness check', 'health': 'b ad', 'vm': 'up', 'detail': 'Up'}, 'bridge': True, 'mem-free': 12763.0, 'maintenance': False, 'cpu-load': 0 .0364, 'gateway': 1.0, 'storage-domain': True} MainThread::INFO::2018-09-18 12:41:11,393::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:41:11,703::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-09-18 12:41:21,716::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:41:22,020::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-09-18 12:41:31,033::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:41:31,344::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) As we can see, agent thinking that HostedEngine just in powering up mode. I cannot to do anythink with it. I allready reinstalled many times srv00 node without success. One time i even has to uninstall ovirt* and vdsm* software. Also here one interesting point, after installing just "yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release42.rpm" on this node i try to install this node from engine web interface with "Deploy" action. But, installation was unsuccesfull, before i didnt install ovirt-hosted-engine-ha on this node. I dont see in documentation that its need bofore installation of new hosts. But this is for information and checking. After installing ovirt-hosted-engine-ha node was installed with HostedEngine support. But the main issue not changed. Thanks in advance for help. BR, Alexandr

On Tue, Sep 18, 2018 at 9:21 AM <asm@pioner.kz> wrote:
Good day for all. I have some issues with Ovirt 4.2.6. But now the main this of it: I have two Centos 7 Nodes with same config and last Ovirt 4.2.6 with Hostedengine with disk on NFS storage. Also some of virtual machines working good. But, when HostedEngine running on one node (srv02.local) everything is fine. After migrating to another node (srv00.local), i see that agent cannot to check livelinness of HostedEngine. After few minutes HostedEngine going to reboot and after some time i see some situation. After migration to another node (srv00.local) all looks OK.
hosted-engine --vm-status commang when HosterEngine on srv00 node: --== Host 1 status ==--
conf_on_shared_storage : True Status up-to-date : True Hostname : srv02.local Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_unexpected", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : ecc7ad2d local_conf_timestamp : 78328 Host timestamp : 78328 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=78328 (Tue Sep 18 12:44:18 2018) host-id=1 score=0 vm_conf_refresh_time=78328 (Tue Sep 18 12:44:18 2018) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 03:49:58 1970
--== Host 2 status ==--
conf_on_shared_storage : True Status up-to-date : True Hostname : srv00.local Host ID : 2 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}
vm: up refers to vm status at virt level polling a local vdsm, health: bad refers instead to a live check on the engine portal over http. Bad name resolution or network routing issues can cause this. I'd suggest to check if everything is fine on network side.
Score : 3400 stopped : False Local maintenance : False crc32 : 1d62b106 local_conf_timestamp : 326288 Host timestamp : 326288 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=326288 (Tue Sep 18 12:44:21 2018) host-id=2 score=3400 vm_conf_refresh_time=326288 (Tue Sep 18 12:44:21 2018) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False
Log agent.log from srv00.local:
MainThread::INFO::2018-09-18 12:40:51,749::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:40:52,052::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-09-18 12:41:01,066::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:41:01,374::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-09-18 12:41:11,393::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2018-09-18 12:41:11,393::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(refresh) Host srv02.local.pioner.kz (id 1): {'conf_on_shared_storage': True, 'extra': 'meta data_parse_version=1\nmetadata_feature_version=1\ntimestamp=78128 (Tue Sep 18 12:40:58 2018)\nhost-id=1\ns core=0\nvm_conf_refresh_time=78128 (Tue Sep 18 12:40:58 2018)\nconf_on_shared_storage=True\nmaintenance=Fa lse\nstate=EngineUnexpectedlyDown\nstopped=False\ntimeout=Fri Jan 2 03:49:58 1970\n', 'hostname': 'srv02. local.pioner.kz', 'alive': True, 'host-id': 1, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down_unexpected', 'detail': 'unknown'}, 'score': 0, 'stopped': False, 'maintenance ': False, 'crc32': 'e18e3f22', 'local_conf_timestamp': 78128, 'host-ts': 78128} MainThread::INFO::2018-09-18 12:41:11,393::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'failed liveliness check', 'health': 'b ad', 'vm': 'up', 'detail': 'Up'}, 'bridge': True, 'mem-free': 12763.0, 'maintenance': False, 'cpu-load': 0 .0364, 'gateway': 1.0, 'storage-domain': True} MainThread::INFO::2018-09-18 12:41:11,393::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:41:11,703::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-09-18 12:41:21,716::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:41:22,020::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-09-18 12:41:31,033::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE ngine::(consume) VM is powering up.. MainThread::INFO::2018-09-18 12:41:31,344::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine. HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) As we can see, agent thinking that HostedEngine just in powering up mode. I cannot to do anythink with it. I allready reinstalled many times srv00 node without success. One time i even has to uninstall ovirt* and vdsm* software. Also here one interesting point, after installing just "yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release42.rpm" on this node i try to install this node from engine web interface with "Deploy" action. But, installation was unsuccesfull, before i didnt install ovirt-hosted-engine-ha on this node. I dont see in documentation that its need bofore installation of new hosts. But this is for information and checking. After installing ovirt-hosted-engine-ha node was installed with HostedEngine support. But the main issue not changed. Thanks in advance for help. BR, Alexandr _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/7KGDIM3X3G4QRC...

Hi! How i can to check the network? All the same on two nodes except IP address. Pinges working fine, and others. Here also broker log srom srv00. You cal see moment when HostedEngine migrated to another host: Thread-2::INFO::2018-09-18 12:48:07,531::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-3::INFO::2018-09-18 12:48:07,767::mem_free::51::mem_free.MemFree::(action) memFree: 12774 Thread-1::INFO::2018-09-18 12:48:07,901::ping::60::ping.Ping::(action) Successfully pinged 192.168.2.248 Thread-2::INFO::2018-09-18 12:48:17,555::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-3::INFO::2018-09-18 12:48:17,812::mem_free::51::mem_free.MemFree::(action) memFree: 12766 Thread-3::INFO::2018-09-18 12:48:26,852::mem_free::51::mem_free.MemFree::(action) memFree: 12757 Thread-1::INFO::2018-09-18 12:48:27,453::ping::60::ping.Ping::(action) Successfully pinged 192.168.2.248 Thread-2::INFO::2018-09-18 12:48:27,587::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-5::WARNING::2018-09-18 12:48:30,495::engine_health::233::engine_health.EngineHealth::(_result_from_stats) bad health status: Hosted Engine is not up! Thread-3::INFO::2018-09-18 12:48:36,894::mem_free::51::mem_free.MemFree::(action) memFree: 12759 Thread-2::INFO::2018-09-18 12:48:37,619::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-1::INFO::2018-09-18 12:48:37,727::ping::60::ping.Ping::(action) Successfully pinged 192.168.2.248 Thread-3::INFO::2018-09-18 12:48:46,944::mem_free::51::mem_free.MemFree::(action) memFree: 12762 Thread-2::INFO::2018-09-18 12:48:47,651::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-1::INFO::2018-09-18 12:48:48,006::ping::60::ping.Ping::(action) Successfully pinged 192.168.2.248 Thread-5::WARNING::2018-09-18 12:48:50,603::engine_health::233::engine_health.EngineHealth::(_result_from_stats) bad health status: Hosted Engine is not up! Thread-3::INFO::2018-09-18 12:48:57,021::mem_free::51::mem_free.MemFree::(action) memFree: 12736 Thread-1::INFO::2018-09-18 12:48:57,285::ping::60::ping.Ping::(action) Successfully pinged 192.168.2.248 Thread-2::INFO::2018-09-18 12:48:57,679::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-4::INFO::2018-09-18 12:49:04,920::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.0401, engine=0.0030, non-e ngine=0.0371 Thread-3::INFO::2018-09-18 12:49:07,064::mem_free::51::mem_free.MemFree::(action) memFree: 12740 Thread-1::INFO::2018-09-18 12:49:07,561::ping::60::ping.Ping::(action) Successfully pinged 192.168.2.248 Thread-2::INFO::2018-09-18 12:49:07,760::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-5::WARNING::2018-09-18 12:49:10,715::engine_health::233::engine_health.EngineHealth::(_result_from_stats) bad health status: Hosted Engine is not up! Thread-5::WARNING::2018-09-18 12:49:10,823::engine_health::233::engine_health.EngineHealth::(_result_from_stats) bad health status: Hosted Engine is not up! Thread-7::WARNING::2018-09-18 12:49:12,961::engine_health::233::engine_health.EngineHealth::(_result_from_stats) bad health status: Hosted Engine is not up! Thread-3::INFO::2018-09-18 12:49:17,114::mem_free::51::mem_free.MemFree::(action) memFree: 12739 Thread-1::INFO::2018-09-18 12:49:17,817::ping::60::ping.Ping::(action) Successfully pinged 192.168.2.248 Thread-2::INFO::2018-09-18 12:49:17,888::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-5::WARNING::2018-09-18 12:49:19,945::engine_health::233::engine_health.EngineHealth::(_result_from_stats) bad health status: Hosted Engine is not up! Thread-7::INFO::2018-09-18 12:49:25,650::engine_health::191::engine_health.EngineHealth::(_result_from_stats) VM successfully migrated away from this host.

Hi! You was right. The problem was due to an error in the hosts file. The FQDN of engine has another IP in this file on this host from previous instalation. Thank you very muth. Please help me with my another question. I know that you can help me.

Hello, we are experiencing probably the same situation, and it is serious failure of the ovirt! After last upgrade of centos 7 hosts and hosted engine (see versions below), the hosted engine does not come up, making the whole ovirt cluster USELESS. I can't connect to engine console any way. I even tried clean install of one centos 7 host with hosted engine on nfs storage with the same result, engine comes up, but failing liveliness check. Here are the versions of clean install, which is also failing: ovirt-engine-appliance.noarch 4.2-20180903.1.el7 @ovirt-4.2 ovirt-engine-sdk-python.noarch 3.6.9.1-1.el7 @ovirt-4.2 ovirt-host.x86_64 4.2.3-1.el7 @ovirt-4.2 ovirt-host-dependencies.x86_64 4.2.3-1.el7 @ovirt-4.2 ovirt-host-deploy.noarch 1.7.4-1.el7 @ovirt-4.2 ovirt-hosted-engine-ha.noarch 2.2.16-1.el7 @ovirt-4.2 ovirt-hosted-engine-setup.noarch 2.2.26-1.el7 @ovirt-4.2 ovirt-imageio-common.x86_64 1.4.4-0.el7 @ovirt-4.2 ovirt-imageio-daemon.noarch 1.4.4-0.el7 @ovirt-4.2 ovirt-provider-ovn-driver.noarch 1.2.14-1.el7 @ovirt-4.2 ovirt-release42.noarch 4.2.6.1-1.el7 installed ovirt-setup-lib.noarch 1.1.5-1.el7 @ovirt-4.2 ovirt-vmconsole.noarch 1.0.5-4.el7.centos @ovirt-4.2 ovirt-vmconsole-host.noarch 1.0.5-4.el7.centos @ovirt-4.2 python-ovirt-engine-sdk4.x86_64 4.2.8-2.el7 @ovirt-4.2 This is a total fail of "intended for production use" system. Even more, when I try to restore regular engine backup to new engine before engine-setup on new clean centos 7 host, it is failing also. What logs would you need to get it analyzed? I can supply it right back. Regards, Pavel

On Wed, Sep 19, 2018 at 3:06 PM Pavel Stržínek <pavel.strzinek@gmail.com> wrote:
Hello, we are experiencing probably the same situation, and it is serious failure of the ovirt! After last upgrade of centos 7 hosts and hosted engine (see versions below), the hosted engine does not come up, making the whole ovirt cluster USELESS. I can't connect to engine console any way. I even tried clean install of one centos 7 host with hosted engine on nfs storage with the same result, engine comes up, but failing liveliness check.
Here are the versions of clean install, which is also failing:
ovirt-engine-appliance.noarch 4.2-20180903.1.el7 @ovirt-4.2 ovirt-engine-sdk-python.noarch 3.6.9.1-1.el7 @ovirt-4.2 ovirt-host.x86_64 4.2.3-1.el7 @ovirt-4.2 ovirt-host-dependencies.x86_64 4.2.3-1.el7 @ovirt-4.2 ovirt-host-deploy.noarch 1.7.4-1.el7 @ovirt-4.2 ovirt-hosted-engine-ha.noarch 2.2.16-1.el7 @ovirt-4.2 ovirt-hosted-engine-setup.noarch 2.2.26-1.el7 @ovirt-4.2 ovirt-imageio-common.x86_64 1.4.4-0.el7 @ovirt-4.2 ovirt-imageio-daemon.noarch 1.4.4-0.el7 @ovirt-4.2 ovirt-provider-ovn-driver.noarch 1.2.14-1.el7 @ovirt-4.2 ovirt-release42.noarch 4.2.6.1-1.el7 installed ovirt-setup-lib.noarch 1.1.5-1.el7 @ovirt-4.2 ovirt-vmconsole.noarch 1.0.5-4.el7.centos @ovirt-4.2 ovirt-vmconsole-host.noarch 1.0.5-4.el7.centos @ovirt-4.2 python-ovirt-engine-sdk4.x86_64 4.2.8-2.el7 @ovirt-4.2
This is a total fail of "intended for production use" system. Even more, when I try to restore regular engine backup to new engine before engine-setup on new clean centos 7 host, it is failing also.
What logs would you need to get it analyzed? I can supply it right back.
vdsm.log from the host trying to start the engine VM. Did you tried to connect to the engine VM over VNC?
Regards, Pavel _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/53IE4M52M4TJRF...
participants (3)
-
asm@pioner.kz
-
Pavel Stržínek
-
Simone Tiraboschi