OST: HE vm does not restart on HC setup

Hi all, On the HC setup, the HE VM is not restarted. The agent.log has MainThread::INFO::2017-02-21 22:09:58,022::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {} MainThread::INFO::2017-02-21 22:09:58,023::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'failed to getVmStats', 'health': 'unknown', 'vm': 'unknown', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 4079.0, 'maintenance': False, 'cpu-load': 0.0491, 'gateway': True} ... MainThread::INFO::2017-02-21 22:10:29,219::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken MainThread::INFO::2017-02-21 22:10:29,219::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1487733029.22 type=state_transition detail=ReinitializeFSM-UnknownLocalVmState hostname='lago-hc-basic-suite-master-host0' MainThread::INFO::2017-02-21 22:10:29,317::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-UnknownLocalVmState) sent? ignored and the vdsm.log 2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Changed state to Down: User shut down from within the guest (code=7) (vm:1269) 2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Stopping connection (guestagent:429) 2017-02-21 22:09:29,727-0500 ERROR (jsonrpc/4) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} What should I be looking for to identify the issue? The logs are at http://jenkins.ovirt.org/job/ovirt_master_hc-system-tests/lastCompletedBuild... thanks sahina

Adding Lev On Wed, Feb 22, 2017 at 12:59 PM, Sahina Bose <sabose@redhat.com> wrote:
Hi all,
On the HC setup, the HE VM is not restarted. The agent.log has MainThread::INFO::2017-02-21 22:09:58,022::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {} MainThread::INFO::2017-02-21 22:09:58,023::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'failed to getVmStats', 'health': 'unknown', 'vm': 'unknown', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 4079.0, 'maintenance': False, 'cpu-load': 0.0491, 'gateway': True} ... MainThread::INFO::2017-02-21 22:10:29,219::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken MainThread::INFO::2017-02-21 22:10:29,219::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1487733029.22 type=state_transition detail=ReinitializeFSM-UnknownLocalVmState hostname='lago-hc-basic-suite-master-host0' MainThread::INFO::2017-02-21 22:10:29,317::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-UnknownLocalVmState) sent? ignored
and the vdsm.log
2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Changed state to Down: User shut down from within the guest (code=7) (vm:1269) 2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Stopping connection (guestagent:429)
2017-02-21 22:09:29,727-0500 ERROR (jsonrpc/4) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
What should I be looking for to identify the issue?
The logs are at http://jenkins.ovirt.org/job/ovirt_master_hc-system-tests/lastCompletedBuild...
thanks
sahina
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
-- Sandro Bonazzola Better technology. Faster innovation. Powered by community collaboration. See how it works at redhat.com

When ovirt-ha-agent checks the status of the engine VM we get: 2017-02-21 22:21:14,738-0500 ERROR (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} While in ovirt-ha-agent logs we have: MainThread::INFO::2017-02-21 22:21:18,583::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state UnknownLocalVmState (score: 3400) ... MainThread::INFO::2017-02-21 22:21:31,199::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken Probably it's a bug or a regression somewhere on master. On Wed, Feb 22, 2017 at 1:02 PM, Sandro Bonazzola <sbonazzo@redhat.com> wrote:
Adding Lev
On Wed, Feb 22, 2017 at 12:59 PM, Sahina Bose <sabose@redhat.com> wrote:
Hi all,
On the HC setup, the HE VM is not restarted. The agent.log has MainThread::INFO::2017-02-21 22:09:58,022::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {} MainThread::INFO::2017-02-21 22:09:58,023::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'failed to getVmStats', 'health': 'unknown', 'vm': 'unknown', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 4079.0, 'maintenance': False, 'cpu-load': 0.0491, 'gateway': True} ... MainThread::INFO::2017-02-21 22:10:29,219::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken MainThread::INFO::2017-02-21 22:10:29,219::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1487733029.22 type=state_transition detail=ReinitializeFSM-UnknownLocalVmState hostname='lago-hc-basic-suite-master-host0' MainThread::INFO::2017-02-21 22:10:29,317::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-UnknownLocalVmState) sent? ignored
and the vdsm.log
2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Changed state to Down: User shut down from within the guest (code=7) (vm:1269) 2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Stopping connection (guestagent:429)
2017-02-21 22:09:29,727-0500 ERROR (jsonrpc/4) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
What should I be looking for to identify the issue?
The logs are at http://jenkins.ovirt.org/job/ovirt_master_hc-system-tests/lastCompletedBuild...
thanks
sahina
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
-- Sandro Bonazzola Better technology. Faster innovation. Powered by community collaboration. See how it works at redhat.com

On Wed, Feb 22, 2017 at 1:33 PM, Simone Tiraboschi <stirabos@redhat.com> wrote:
When ovirt-ha-agent checks the status of the engine VM we get:
2017-02-21 22:21:14,738-0500 ERROR (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
While in ovirt-ha-agent logs we have:
MainThread::INFO::2017-02-21 22:21:18,583::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state UnknownLocalVmState (score: 3400)
...
MainThread::INFO::2017-02-21 22:21:31,199::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken
Probably it's a bug or a regression somewhere on master.
On ovirt-ha-broker side the detection is based on a strict string match on the error message that is expected to be exactly 'Virtual machine does not exist' to set down status otherwise we set unknown status as in this case: https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_... Adding Francesco here to understand if something has recently changed there on vdsm side.
On Wed, Feb 22, 2017 at 1:02 PM, Sandro Bonazzola <sbonazzo@redhat.com> wrote:
Adding Lev
On Wed, Feb 22, 2017 at 12:59 PM, Sahina Bose <sabose@redhat.com> wrote:
Hi all,
On the HC setup, the HE VM is not restarted. The agent.log has MainThread::INFO::2017-02-21 22:09:58,022::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {} MainThread::INFO::2017-02-21 22:09:58,023::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'failed to getVmStats', 'health': 'unknown', 'vm': 'unknown', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 4079.0, 'maintenance': False, 'cpu-load': 0.0491, 'gateway': True} ... MainThread::INFO::2017-02-21 22:10:29,219::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken MainThread::INFO::2017-02-21 22:10:29,219::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1487733029.22 type=state_transition detail=ReinitializeFSM-UnknownLocalVmState hostname='lago-hc-basic-suite-master-host0' MainThread::INFO::2017-02-21 22:10:29,317::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-UnknownLocalVmState) sent? ignored
and the vdsm.log
2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Changed state to Down: User shut down from within the guest (code=7) (vm:1269) 2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Stopping connection (guestagent:429)
2017-02-21 22:09:29,727-0500 ERROR (jsonrpc/4) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
What should I be looking for to identify the issue?
The logs are at http://jenkins.ovirt.org/job/ovirt_master_hc-system-tests/lastCompletedBuild...
thanks
sahina
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
-- Sandro Bonazzola Better technology. Faster innovation. Powered by community collaboration. See how it works at redhat.com

On 22 Feb 2017, at 13:53, Simone Tiraboschi <stirabos@redhat.com> wrote:
On Wed, Feb 22, 2017 at 1:33 PM, Simone Tiraboschi <stirabos@redhat.com <mailto:stirabos@redhat.com>> wrote: When ovirt-ha-agent checks the status of the engine VM we get: 2017-02-21 22:21:14,738-0500 ERROR (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
While in ovirt-ha-agent logs we have: MainThread::INFO::2017-02-21 22:21:18,583::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state UnknownLocalVmState (score: 3400) ... MainThread::INFO::2017-02-21 22:21:31,199::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken Probably it's a bug or a regression somewhere on master.
On ovirt-ha-broker side the detection is based on a strict string match on the error message that is expected to be exactly 'Virtual machine does not exist' to set down status otherwise we set unknown status as in this case: https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_... <https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/broker/submonitors/engine_health.py;h=d633cb860b811e84021221771bf706a9a4ac1d63;hb=refs/heads/master#l54>
Adding Francesco here to understand if something has recently changed there on vdsm side.
That’s not a very robust code handling. Yes, the text changed, the vm id was added. And yes, it may change again any time I guess
On Wed, Feb 22, 2017 at 1:02 PM, Sandro Bonazzola <sbonazzo@redhat.com <mailto:sbonazzo@redhat.com>> wrote: Adding Lev
On Wed, Feb 22, 2017 at 12:59 PM, Sahina Bose <sabose@redhat.com <mailto:sabose@redhat.com>> wrote: Hi all,
On the HC setup, the HE VM is not restarted. The agent.log has MainThread::INFO::2017-02-21 22:09:58,022::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {} MainThread::INFO::2017-02-21 22:09:58,023::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'failed to getVmStats', 'health': 'unknown', 'vm': 'unknown', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 4079.0, 'maintenance': False, 'cpu-load': 0.0491, 'gateway': True} ... MainThread::INFO::2017-02-21 22:10:29,219::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken MainThread::INFO::2017-02-21 22:10:29,219::brokerlink::111::ovirt_hosted_engine_ha.lib.br <http://ovirt_hosted_engine_ha.lib.br/>okerlink.BrokerLink::(notify) Trying: notify time=1487733029.22 type=state_transition detail=ReinitializeFSM-UnknownLocalVmState hostname='lago-hc-basic-suite-master-host0' MainThread::INFO::2017-02-21 22:10:29,317::brokerlink::121::ovirt_hosted_engine_ha.lib.br <http://ovirt_hosted_engine_ha.lib.br/>okerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-UnknownLocalVmState) sent? ignored and the vdsm.log
2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Changed state to Down: User shut down from within the guest (code=7) (vm:1269) 2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Stopping connection (guestagent:429)
2017-02-21 22:09:29,727-0500 ERROR (jsonrpc/4) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
What should I be looking for to identify the issue?
The logs are at http://jenkins.ovirt.org/job/ovirt_master_hc-system-tests/lastCompletedBuild... <http://jenkins.ovirt.org/job/ovirt_master_hc-system-tests/lastCompletedBuild/artifact/exported-artifacts/test_logs/hc-basic-suite-master/post-002_bootstrap.py/lago-hc-basic-suite-master-host0>
thanks sahina
_______________________________________________ Devel mailing list Devel@ovirt.org <mailto:Devel@ovirt.org> http://lists.ovirt.org/mailman/listinfo/devel <http://lists.ovirt.org/mailman/listinfo/devel>
-- Sandro Bonazzola Better technology. Faster innovation. Powered by community collaboration. See how it works at redhat.com <http://redhat.com/>
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On Wed, Feb 22, 2017 at 3:22 PM, Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 22 Feb 2017, at 13:53, Simone Tiraboschi <stirabos@redhat.com> wrote:
On Wed, Feb 22, 2017 at 1:33 PM, Simone Tiraboschi <stirabos@redhat.com> wrote:
When ovirt-ha-agent checks the status of the engine VM we get:
2017-02-21 22:21:14,738-0500 ERROR (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
While in ovirt-ha-agent logs we have:
MainThread::INFO::2017-02-21 22:21:18,583::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state UnknownLocalVmState (score: 3400)
...
MainThread::INFO::2017-02-21 22:21:31,199::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken
Probably it's a bug or a regression somewhere on master.
On ovirt-ha-broker side the detection is based on a strict string match on the error message that is expected to be exactly 'Virtual machine does not exist' to set down status otherwise we set unknown status as in this case: https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine- ha.git;a=blob;f=ovirt_hosted_engine_ha/broker/submonitors/ engine_health.py;h=d633cb860b811e84021221771bf706a9a4ac1d63;hb=refs/heads/ master#l54
Adding Francesco here to understand if something has recently changed there on vdsm side.
That’s not a very robust code handling. Yes, the text changed, the vm id was added. And yes, it may change again any time I guess
I agree, we are going to move to code check: https://gerrit.ovirt.org/#/c/72891
On Wed, Feb 22, 2017 at 1:02 PM, Sandro Bonazzola <sbonazzo@redhat.com> wrote:
Adding Lev
On Wed, Feb 22, 2017 at 12:59 PM, Sahina Bose <sabose@redhat.com> wrote:
Hi all,
On the HC setup, the HE VM is not restarted. The agent.log has MainThread::INFO::2017-02-21 22:09:58,022::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {} MainThread::INFO::2017-02-21 22:09:58,023::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'failed to getVmStats', 'health': 'unknown', 'vm': 'unknown', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 4079.0, 'maintenance': False, 'cpu-load': 0.0491, 'gateway': True} ... MainThread::INFO::2017-02-21 22:10:29,219::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken MainThread::INFO::2017-02-21 22:10:29,219::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1487733029.22 type=state_transition detail=ReinitializeFSM-UnknownLocalVmState hostname='lago-hc-basic-suite-master-host0' MainThread::INFO::2017-02-21 22:10:29,317::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-UnknownLocalVmState) sent? ignored
and the vdsm.log
2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Changed state to Down: User shut down from within the guest (code=7) (vm:1269) 2017-02-21 22:09:11,962-0500 INFO (libvirt/events) [virt.vm] (vmId='2ccc0ef0-cc31-45b8-8e91-a78fa4cad671') Stopping connection (guestagent:429)
2017-02-21 22:09:29,727-0500 ERROR (jsonrpc/4) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
What should I be looking for to identify the issue?
The logs are at http://jenkins.ovirt.org/job/ovirt_master_hc-system-tests/lastCompletedBuild...
thanks
sahina
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
-- Sandro Bonazzola Better technology. Faster innovation. Powered by community collaboration. See how it works at redhat.com
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On 02/22/2017 01:53 PM, Simone Tiraboschi wrote:
On Wed, Feb 22, 2017 at 1:33 PM, Simone Tiraboschi <stirabos@redhat.com <mailto:stirabos@redhat.com>> wrote:
When ovirt-ha-agent checks the status of the engine VM we get:
2017-02-21 22:21:14,738-0500 ERROR (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
While in ovirt-ha-agent logs we have:
MainThread::INFO::2017-02-21 22:21:18,583::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state UnknownLocalVmState (score: 3400)
...
MainThread::INFO::2017-02-21 22:21:31,199::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken
Probably it's a bug or a regression somewhere on master.
On ovirt-ha-broker side the detection is based on a strict string match on the error message that is expected to be exactly 'Virtual machine does not exist' to set down status otherwise we set unknown status as in this case: https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_...
Adding Francesco here to understand if something has recently changed there on vdsm side.
It has changed indeed; we had a series of changes which added context to some exceptions. I believe the straw who broke the camel's back was I32ec3f86f8d53f8412f4c0526fc85e2a42e30ea5 It is unfortunate that this change broke HA. Could you perhaps fixing it checking that the message *begins* with that string, and/or checking the error code. bests, -- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani

broke HA. Could you perhaps fixing it checking that the message *begins* with that string, and/or checking the error code. bests,
Already done: https://gerrit.ovirt.org/#/c/72791/ Martin On Wed, Feb 22, 2017 at 3:32 PM, Francesco Romani <fromani@redhat.com> wrote:
On 02/22/2017 01:53 PM, Simone Tiraboschi wrote:
On Wed, Feb 22, 2017 at 1:33 PM, Simone Tiraboschi <stirabos@redhat.com> wrote:
When ovirt-ha-agent checks the status of the engine VM we get:
2017-02-21 22:21:14,738-0500 ERROR (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
While in ovirt-ha-agent logs we have:
MainThread::INFO::2017-02-21 22:21:18,583::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state UnknownLocalVmState (score: 3400)
...
MainThread::INFO::2017-02-21 22:21:31,199::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken
Probably it's a bug or a regression somewhere on master.
On ovirt-ha-broker side the detection is based on a strict string match on the error message that is expected to be exactly 'Virtual machine does not exist' to set down status otherwise we set unknown status as in this case: https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_...
Adding Francesco here to understand if something has recently changed there on vdsm side.
It has changed indeed; we had a series of changes which added context to some exceptions. I believe the straw who broke the camel's back was I32ec3f86f8d53f8412f4c0526fc85e2a42e30ea5 It is unfortunate that this change broke HA. Could you perhaps fixing it checking that the message *begins* with that string, and/or checking the error code. bests,
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Btw, this whole thing is probably the oldest untouched piece of hosted engine we have. It worked till now :) Martin On Wed, Feb 22, 2017 at 3:38 PM, Martin Sivak <msivak@redhat.com> wrote:
broke HA. Could you perhaps fixing it checking that the message *begins* with that string, and/or checking the error code. bests,
Already done:
https://gerrit.ovirt.org/#/c/72791/
Martin
On Wed, Feb 22, 2017 at 3:32 PM, Francesco Romani <fromani@redhat.com> wrote:
On 02/22/2017 01:53 PM, Simone Tiraboschi wrote:
On Wed, Feb 22, 2017 at 1:33 PM, Simone Tiraboschi <stirabos@redhat.com> wrote:
When ovirt-ha-agent checks the status of the engine VM we get:
2017-02-21 22:21:14,738-0500 ERROR (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
While in ovirt-ha-agent logs we have:
MainThread::INFO::2017-02-21 22:21:18,583::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state UnknownLocalVmState (score: 3400)
...
MainThread::INFO::2017-02-21 22:21:31,199::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken
Probably it's a bug or a regression somewhere on master.
On ovirt-ha-broker side the detection is based on a strict string match on the error message that is expected to be exactly 'Virtual machine does not exist' to set down status otherwise we set unknown status as in this case: https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_...
Adding Francesco here to understand if something has recently changed there on vdsm side.
It has changed indeed; we had a series of changes which added context to some exceptions. I believe the straw who broke the camel's back was I32ec3f86f8d53f8412f4c0526fc85e2a42e30ea5 It is unfortunate that this change broke HA. Could you perhaps fixing it checking that the message *begins* with that string, and/or checking the error code. bests,
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On Wed, Feb 22, 2017 at 4:32 PM Francesco Romani <fromani@redhat.com> wrote:
On 02/22/2017 01:53 PM, Simone Tiraboschi wrote:
On Wed, Feb 22, 2017 at 1:33 PM, Simone Tiraboschi <stirabos@redhat.com> wrote:
When ovirt-ha-agent checks the status of the engine VM we get:
2017-02-21 22:21:14,738-0500 ERROR (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
While in ovirt-ha-agent logs we have:
MainThread::INFO::2017-02-21 22:21:18,583::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state UnknownLocalVmState (score: 3400)
...
MainThread::INFO::2017-02-21 22:21:31,199::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken
Probably it's a bug or a regression somewhere on master.
On ovirt-ha-broker side the detection is based on a strict string match on the error message that is expected to be exactly 'Virtual machine does not exist' to set down status otherwise we set unknown status as in this case:
https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_...
Adding Francesco here to understand if something has recently changed there on vdsm side.
It has changed indeed; we had a series of changes which added context to some exceptions. I believe the straw who broke the camel's back was I32ec3f86f8d53f8412f4c0526fc85e2a42e30ea5 It is unfortunate that this change broke HA. Could you perhaps fixing it checking that the message *begins* with that string, and/or checking the error code. bests,
On the bright side, this is exactly why we need o-s-t running Hosted-Engine - though we probably need to exercise more HE flows (global and local maint., for example). On the downside, how come I32ec3f86f8d53f8412f4c0526fc85e2a42e30ea5 was merged on Jan1st, and we only saw the regression now? Is there another bug that hid this one until now? Y.
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On 02/22/2017 03:42 PM, Yaniv Kaul wrote:
On Wed, Feb 22, 2017 at 4:32 PM Francesco Romani <fromani@redhat.com <mailto:fromani@redhat.com>> wrote:
On 02/22/2017 01:53 PM, Simone Tiraboschi wrote:
On Wed, Feb 22, 2017 at 1:33 PM, Simone Tiraboschi <stirabos@redhat.com <mailto:stirabos@redhat.com>> wrote:
When ovirt-ha-agent checks the status of the engine VM we get:
2017-02-21 22:21:14,738-0500 ERROR (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'} (api:69) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 67, in method ret = func(*args, **kwargs) File "/usr/share/vdsm/API.py", line 335, in getStats vm = self.vm File "/usr/share/vdsm/API.py", line 130, in vm raise exception.NoSuchVM(vmId=self._UUID) NoSuchVM: Virtual machine does not exist: {'vmId': u'2ccc0ef0-cc31-45b8-8e91-a78fa4cad671'}
While in ovirt-ha-agent logs we have:
MainThread::INFO::2017-02-21 22:21:18,583::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state UnknownLocalVmState (score: 3400)
...
MainThread::INFO::2017-02-21 22:21:31,199::state_decorators::25::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Unknown local engine vm status no actions taken
Probably it's a bug or a regression somewhere on master.
On ovirt-ha-broker side the detection is based on a strict string match on the error message that is expected to be exactly 'Virtual machine does not exist' to set down status otherwise we set unknown status as in this case: https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_...
Adding Francesco here to understand if something has recently changed there on vdsm side.
It has changed indeed; we had a series of changes which added context to some exceptions. I believe the straw who broke the camel's back was I32ec3f86f8d53f8412f4c0526fc85e2a42e30ea5 It is unfortunate that this change broke HA. Could you perhaps fixing it checking that the message *begins* with that string, and/or checking the error code. bests,
On the bright side, this is exactly why we need o-s-t running Hosted-Engine - though we probably need to exercise more HE flows (global and local maint., for example). On the downside, how come I32ec3f86f8d53f8412f4c0526fc85e2a42e30ea5 was merged on Jan1st, and we only saw the regression now? Is there another bug that hid this one until now? Y.
It was merged on Jan 29 on master, backported on Feb 8 on 4.1 branch (because it was part of the vmleases feature, needed on 4.1.z). Bests, -- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
participants (7)
-
Francesco Romani
-
Martin Sivak
-
Michal Skrivanek
-
Sahina Bose
-
Sandro Bonazzola
-
Simone Tiraboschi
-
Yaniv Kaul