Self-hosted engine won't start

older
Question on Backup and Restore API

John Gardeniers

24 Jul 2014 24 Jul '14

1:29 a.m.

Hi All, I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted. When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over. ovirt1 (192.168.19.20): MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400) ovirt2 (192.168.19.21): MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400)

...

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later.

The interesting part is that each hypervisor seems to think the other is a better host. The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable. May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine. regards, John

Show replies by date

Jason Brooks

24 Jul 24 Jul

1:47 a.m.

----- Original Message -----

...

From: "John Gardeniers" <jgardeniers@objectmastery.com> To: "users" <users@ovirt.org> Sent: Wednesday, July 23, 2014 4:29:45 PM Subject: [ovirt-users] Self-hosted engine won't start

Hi All,

I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted.

When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over.

ovirt1 (192.168.19.20):

MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400)

ovirt2 (192.168.19.21):

MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400)

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later.

I've seen this behavior, too. Jason

...

The interesting part is that each hypervisor seems to think the other is a better host. The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable.

May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine.

regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Daniel Helgenberger

18 Aug 18 Aug

11:18 a.m.

--=-EaC1hOOG3Eik5S0hXrkq Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello John, On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote:

...

=20 ----- Original Message -----

...
From: "John Gardeniers" <jgardeniers@objectmastery.com> To: "users" <users@ovirt.org> Sent: Wednesday, July 23, 2014 4:29:45 PM Subject: [ovirt-users] Self-hosted engine won't start =20 Hi All, =20 I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted. =20 When the engine didn't restart I re-ran hosted-engine --set-maintenance=3Dnone and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn'= t restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over. =20 ovirt1 (192.168.19.20): =20 MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.Br= okerLink::(notify) Trying: notify time=3D1406157520.27 type=3Dstate_transition detail=3DEngineDown-EngineDown hostname=3D'ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.Br= okerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_e= ngine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_e= ngine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400) =20 ovirt2 (192.168.19.21): =20 MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.Br= okerLink::(notify) Trying: notify time=3D1406157484.01 type=3Dstate_transition detail=3DEngineDown-EngineDown hostname=3D'ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.Br= okerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_e= ngine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_e= ngine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400) =20 From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later. =20 I've seen this behavior, too. =20 Jason =20 =20 The interesting part is that each hypervisor seems to think the other i= s a better host.=20 Where do you get this from? From the line:=20 'Best remote host 192.168.19.20 (id: 1, score: 2400)' ?

I assume this is not the case; HA broker just looking for the best remote candidate.=20 But I have also trouble with this behavior; esp. when I had the cluster in global maintenance. I resolve this by stating hosted engine manually in in global maintenance and waiting for {"health": "good", "vm": "up", "detail": "up"} and disabling global maintenance afterwards. I found the HA feature is indeed working - and tried out best by manually stopping the engine service (service hosted-engine stop). IIRC This should trigger a failover and reboot of the engine.

...

The two machines are identical, so there's no reason I

...
can see for this odd behaviour. In a lab environment this is little mor= e than an annoying inconvenience. In a production environment it would be completely unacceptable. =20 May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine. =20 regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users =20

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Cheers,=20 Daniel --=20 Daniel Helgenberger=20 m box bewegtbild GmbH=20 P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19=20 D-10115 BERLIN=20 www.m-box.de www.monkeymen.tv=20 Gesch=C3=A4ftsf=C3=BChrer: Martin Retschitzegger / Michaela G=C3=B6llner Handeslregister: Amtsgericht Charlottenburg / HRB 112767=20 --=-EaC1hOOG3Eik5S0hXrkq Content-Type: application/x-pkcs7-signature; name="smime.p7s" Content-Disposition: attachment; filename="smime.p7s" Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIN9zCCBFcw ggM/oAMCAQICCwQAAAAAAS9O4TFGMA0GCSqGSIb3DQEBBQUAMFcxCzAJBgNVBAYTAkJFMRkwFwYD VQQKExBHbG9iYWxTaWduIG52LXNhMRAwDgYDVQQLEwdSb290IENBMRswGQYDVQQDExJHbG9iYWxT aWduIFJvb3QgQ0EwHhcNMTEwNDEzMTAwMDAwWhcNMTkwNDEzMTAwMDAwWjBUMQswCQYDVQQGEwJC RTEZMBcGA1UEChMQR2xvYmFsU2lnbiBudi1zYTEqMCgGA1UEAxMhR2xvYmFsU2lnbiBQZXJzb25h bFNpZ24gMiBDQSAtIEcyMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAwWtB+TXs+BJ9 3SJRaV+3uRNGJ3cUO+MTgW8+5HQXfgy19CzkDI1T1NwwICi/bo4R/mYR5FEWx91//eE0ElC/89iY 7GkL0tDasmVx4TOXnrqrsziUcxEPPqHRE8x4NhtBK7+8o0nsMIJMA1gyZ2FA5To2Ew1BBuvovvDJ +Nua3qOCNBNu+8A+eNpJlVnlu/qB7+XWaPXtUMlsIikxD+gREFVUgYE4VzBuLa2kkg0VLd09XkE2 ceRDm6YgRATuDk6ogUyX4OLxCGIJF8yi6Z37M0wemDA6Uff0EuqdwDQd5HwG/rernUjt1grLdAxq 8BwywRRg0eFHmE+ShhpyO3Fi+wIDAQABo4IBJTCCASEwDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB /wQIMAYBAf8CAQAwHQYDVR0OBBYEFD8V0m18L+cxnkMKBqiUbCw7xe5lMEcGA1UdIARAMD4wPAYE VR0gADA0MDIGCCsGAQUFBwIBFiZodHRwczovL3d3dy5nbG9iYWxzaWduLmNvbS9yZXBvc2l0b3J5 LzAzBgNVHR8ELDAqMCigJqAkhiJodHRwOi8vY3JsLmdsb2JhbHNpZ24ubmV0L3Jvb3QuY3JsMD0G CCsGAQUFBwEBBDEwLzAtBggrBgEFBQcwAYYhaHR0cDovL29jc3AuZ2xvYmFsc2lnbi5jb20vcm9v dHIxMB8GA1UdIwQYMBaAFGB7ZhpFDZfKiVAvfQTNNKj//P1LMA0GCSqGSIb3DQEBBQUAA4IBAQDI WOF8oQHpI41wO21cUvjE819juuGa05F5yK/ESqW+9th9vfhG92eaBSLViTIJV7gfCFbt11WexfK/ 44NeiJMfi5wX6sK7Xnt8QIK5lH7ZX1Wg/zK1cXjrgRaYUOX/MA+PmuRm4gWV0zFwYOK2uv4OFgaM mVr+8en7K1aQY2ecI9YhEaDWOcSGj6SN8DvzPdE4G4tBk4/aIsUged9sGDqRYweKla3LTNjXPps1 Y+zsVbgHLtjdOIB0YZ1hrlAQcY2L/b+V+Yyoi7CMdOtmm1Rm6Jh5ILbwQTjlUCkgu5yVdfs9LDKc M0SPeCldkjfaGVSd+nURMOUy3hfxsMVux9+FMIIEyjCCA7KgAwIBAgIRAJZpZsDepakv5CafojXo PKcwDQYJKoZIhvcNAQEFBQAwVDELMAkGA1UEBhMCQkUxGTAXBgNVBAoTEEdsb2JhbFNpZ24gbnYt c2ExKjAoBgNVBAMTIUdsb2JhbFNpZ24gUGVyc29uYWxTaWduIDIgQ0EgLSBHMjAeFw0xMzA4Mjcx NjU3NThaFw0xNjA4MjcxNjU3NThaMFgxCzAJBgNVBAYTAkRFMRwwGgYDVQQDExNEYW5pZWwgSGVs Z2VuYmVyZ2VyMSswKQYJKoZIhvcNAQkBFhxkYW5pZWwuaGVsZ2VuYmVyZ2VyQG0tYm94LmRlMIIB IjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAzgFDm8+SeTU4Yt3WopJQgqZAuuNxyMlxiPuq 0C0D581goXz2nVVjhTCIVwX2MqWYD1Dyjy1hLHXothgWgZaiQ1EB4oVdmIFmIfIjR6SkR/Gjw3lx MwJzEpxJhZXyyrOYE8Kgw2maJWgLx5zw2/lKpcffhVW0OY0t+JWWxPKiYFcAmQnb+fleonM8sUZZ ZES08uRVVL67jbq+3+E2xCLlqQ2iJ1h5ej3wlyuZ4CkUnfMHYrG8zOIfHwsPirWACX026a1flgts Kl1Yv0CRZ1c5qujcP3OPpDovIbBr9RBStl2DcFdzTuGMdmfp32963VLOlvKpClPMzrfJeJfWZ4Qy UwIDAQABo4IBkTCCAY0wDgYDVR0PAQH/BAQDAgWgMEwGA1UdIARFMEMwQQYJKwYBBAGgMgEoMDQw MgYIKwYBBQUHAgEWJmh0dHBzOi8vd3d3Lmdsb2JhbHNpZ24uY29tL3JlcG9zaXRvcnkvMCcGA1Ud EQQgMB6BHGRhbmllbC5oZWxnZW5iZXJnZXJAbS1ib3guZGUwCQYDVR0TBAIwADAdBgNVHSUEFjAU BggrBgEFBQcDAgYIKwYBBQUHAwQwQwYDVR0fBDwwOjA4oDagNIYyaHR0cDovL2NybC5nbG9iYWxz aWduLmNvbS9ncy9nc3BlcnNvbmFsc2lnbjJnMi5jcmwwVQYIKwYBBQUHAQEESTBHMEUGCCsGAQUF BzAChjlodHRwOi8vc2VjdXJlLmdsb2JhbHNpZ24uY29tL2NhY2VydC9nc3BlcnNvbmFsc2lnbjJn Mi5jcnQwHQYDVR0OBBYEFLw0UD+6l35aKnDaePxEP8K35HYZMB8GA1UdIwQYMBaAFD8V0m18L+cx nkMKBqiUbCw7xe5lMA0GCSqGSIb3DQEBBQUAA4IBAQBdVOm7h+E4sRMBbTN1tCIjAEgxmB5U0mdZ XcawzEHLJxTrc/5YFBMGX2qPju8cuZV14XszMfRBJdlJz1Od+voJggianIhnFEAakCxaa1l/cmJ5 EDT6PgZAkXbMB5rU1dhegb35lJJkcFLEpR2tF1V0TfbSe5UZNPYeMQjYsRhs69pfKLoeGm4dSLK7 gsPT5EhPd+JPyNSIootOwClMP4CTxIsXQgRI5IDqG2Ku/r2YMMLsqWD11PtAE87t2mgohQ6V1XdW FqGd1V+wN98oPumRRS8bld+1gRA7GVYMnO5MF6p//iHFcy3MVT05ojqgomMt+voH5cFzrHA61z80 xaZ6MIIEyjCCA7KgAwIBAgIRAJZpZsDepakv5CafojXoPKcwDQYJKoZIhvcNAQEFBQAwVDELMAkG A1UEBhMCQkUxGTAXBgNVBAoTEEdsb2JhbFNpZ24gbnYtc2ExKjAoBgNVBAMTIUdsb2JhbFNpZ24g UGVyc29uYWxTaWduIDIgQ0EgLSBHMjAeFw0xMzA4MjcxNjU3NThaFw0xNjA4MjcxNjU3NThaMFgx CzAJBgNVBAYTAkRFMRwwGgYDVQQDExNEYW5pZWwgSGVsZ2VuYmVyZ2VyMSswKQYJKoZIhvcNAQkB FhxkYW5pZWwuaGVsZ2VuYmVyZ2VyQG0tYm94LmRlMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIB CgKCAQEAzgFDm8+SeTU4Yt3WopJQgqZAuuNxyMlxiPuq0C0D581goXz2nVVjhTCIVwX2MqWYD1Dy jy1hLHXothgWgZaiQ1EB4oVdmIFmIfIjR6SkR/Gjw3lxMwJzEpxJhZXyyrOYE8Kgw2maJWgLx5zw 2/lKpcffhVW0OY0t+JWWxPKiYFcAmQnb+fleonM8sUZZZES08uRVVL67jbq+3+E2xCLlqQ2iJ1h5 ej3wlyuZ4CkUnfMHYrG8zOIfHwsPirWACX026a1flgtsKl1Yv0CRZ1c5qujcP3OPpDovIbBr9RBS tl2DcFdzTuGMdmfp32963VLOlvKpClPMzrfJeJfWZ4QyUwIDAQABo4IBkTCCAY0wDgYDVR0PAQH/ BAQDAgWgMEwGA1UdIARFMEMwQQYJKwYBBAGgMgEoMDQwMgYIKwYBBQUHAgEWJmh0dHBzOi8vd3d3 Lmdsb2JhbHNpZ24uY29tL3JlcG9zaXRvcnkvMCcGA1UdEQQgMB6BHGRhbmllbC5oZWxnZW5iZXJn ZXJAbS1ib3guZGUwCQYDVR0TBAIwADAdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwQwQwYD VR0fBDwwOjA4oDagNIYyaHR0cDovL2NybC5nbG9iYWxzaWduLmNvbS9ncy9nc3BlcnNvbmFsc2ln bjJnMi5jcmwwVQYIKwYBBQUHAQEESTBHMEUGCCsGAQUFBzAChjlodHRwOi8vc2VjdXJlLmdsb2Jh bHNpZ24uY29tL2NhY2VydC9nc3BlcnNvbmFsc2lnbjJnMi5jcnQwHQYDVR0OBBYEFLw0UD+6l35a KnDaePxEP8K35HYZMB8GA1UdIwQYMBaAFD8V0m18L+cxnkMKBqiUbCw7xe5lMA0GCSqGSIb3DQEB BQUAA4IBAQBdVOm7h+E4sRMBbTN1tCIjAEgxmB5U0mdZXcawzEHLJxTrc/5YFBMGX2qPju8cuZV1 4XszMfRBJdlJz1Od+voJggianIhnFEAakCxaa1l/cmJ5EDT6PgZAkXbMB5rU1dhegb35lJJkcFLE pR2tF1V0TfbSe5UZNPYeMQjYsRhs69pfKLoeGm4dSLK7gsPT5EhPd+JPyNSIootOwClMP4CTxIsX QgRI5IDqG2Ku/r2YMMLsqWD11PtAE87t2mgohQ6V1XdWFqGd1V+wN98oPumRRS8bld+1gRA7GVYM nO5MF6p//iHFcy3MVT05ojqgomMt+voH5cFzrHA61z80xaZ6MYIC5zCCAuMCAQEwaTBUMQswCQYD VQQGEwJCRTEZMBcGA1UEChMQR2xvYmFsU2lnbiBudi1zYTEqMCgGA1UEAxMhR2xvYmFsU2lnbiBQ ZXJzb25hbFNpZ24gMiBDQSAtIEcyAhEAlmlmwN6lqS/kJp+iNeg8pzAJBgUrDgMCGgUAoIIBUzAY BgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDA4MTgwOTE4MzJaMCMG CSqGSIb3DQEJBDEWBBSK6tEWHH8Um/Pl4IXFY1XskMHHmjB4BgkrBgEEAYI3EAQxazBpMFQxCzAJ BgNVBAYTAkJFMRkwFwYDVQQKExBHbG9iYWxTaWduIG52LXNhMSowKAYDVQQDEyFHbG9iYWxTaWdu IFBlcnNvbmFsU2lnbiAyIENBIC0gRzICEQCWaWbA3qWpL+Qmn6I16DynMHoGCyqGSIb3DQEJEAIL MWugaTBUMQswCQYDVQQGEwJCRTEZMBcGA1UEChMQR2xvYmFsU2lnbiBudi1zYTEqMCgGA1UEAxMh R2xvYmFsU2lnbiBQZXJzb25hbFNpZ24gMiBDQSAtIEcyAhEAlmlmwN6lqS/kJp+iNeg8pzANBgkq hkiG9w0BAQEFAASCAQBiLe7ygakwS/OEWYL/fVh0VERds0yoY4CuV//EbzdCNCStWPxuvqe73IB7 V7J7gTLfDWTBwbT2vJ3Ldq3c6ti22BsrSQeErWmAPmKWzcer6sIVPefvu8pS8xJV+udgv2erjZSk ZVeXqBxK4vY03ZPHdyd38JbdckzKOLE2Qqhz4mWKyiL/H+7+xIf/DsCA50UkwzremSVu7mBoj/L4 Pz4SQtv3oa41VB2fDrWtJ+HjYmRKpTfTF84wIbVfiJE96jd79ZHTcmNI8lbHxT50FYQVPYf3kzY/ ClNIiDdLgplAGWEA1Jw5dN+iBr6fCAYBuhjaNg2wfRrabKFBo7VfltY5AAAAAAAA --=-EaC1hOOG3Eik5S0hXrkq--

John Gardeniers

11:42 p.m.

This is a multi-part message in MIME format. --------------010306090902080504020806 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Hi Daniel, As per my original post, each host believed the *other* is a better candidate, with the result that neither would start the engine. As you may have read by now, the bug has been confirmed and a fix has been proposed. Your claim that HA is working is incorrect. A system that requires manual intervention when something goes wrong is not HA. regards, John On 18/08/14 19:18, Daniel Helgenberger wrote:

...

Hello John,

On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote:

...
----- Original Message -----

...
From: "John Gardeniers" <jgardeniers@objectmastery.com> To: "users" <users@ovirt.org> Sent: Wednesday, July 23, 2014 4:29:45 PM Subject: [ovirt-users] Self-hosted engine won't start

Hi All,

I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted.

When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over.

ovirt1 (192.168.19.20):

MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400)

ovirt2 (192.168.19.21):

MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400)

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later. I've seen this behavior, too.

Jason

...
The interesting part is that each hypervisor seems to think the other is a better host. Where do you get this from? From the line: 'Best remote host 192.168.19.20 (id: 1, score: 2400)' ?

I assume this is not the case; HA broker just looking for the best remote candidate.

But I have also trouble with this behavior; esp. when I had the cluster in global maintenance. I resolve this by stating hosted engine manually in in global maintenance and waiting for {"health": "good", "vm": "up", "detail": "up"} and disabling global maintenance afterwards.

I found the HA feature is indeed working - and tried out best by manually stopping the engine service (service hosted-engine stop). IIRC This should trigger a failover and reboot of the engine.

...
The two machines are identical, so there's no reason I

...
can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable.

May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine.

regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Cheers, Daniel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

--------------010306090902080504020806 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> Hi Daniel, As per my original post, each host believed the *other* is a better candidate, with the result that neither would start the engine. As you may have read by now, the bug has been confirmed and a fix has been proposed. Your claim that HA is working is incorrect. A system that requires manual intervention when something goes wrong is not HA. regards, John <div class="moz-cite-prefix">On 18/08/14 19:18, Daniel Helgenberger wrote: </div> <blockquote cite="mid:1408353512.5654.3.camel@m-box.de" type="cite"> <pre wrap="">Hello John, On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote: </pre> <blockquote type="cite"> <pre wrap=""> ----- Original Message ----- </pre> <blockquote type="cite"> <pre wrap="">From: "John Gardeniers" <a class="moz-txt-link-rfc2396E" href="mailto:jgardeniers@objectmastery.com"><jgardeniers@objectmastery.com></a> To: "users" <a class="moz-txt-link-rfc2396E" href="mailto:users@ovirt.org"><users@ovirt.org></a> Sent: Wednesday, July 23, 2014 4:29:45 PM Subject: [ovirt-users] Self-hosted engine won't start Hi All, I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in <a class="moz-txt-link-freetext" href="http://www.ovirt.org/Hosted_Engine_Howto">http://www.ovirt.org/Hosted_Engine_Howto</a> and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted. When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over. ovirt1 (192.168.19.20): MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400) ovirt2 (192.168.19.21): MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400)

...

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later. </pre> </blockquote> <pre wrap=""> I've seen this behavior, too.

Jason </pre> <blockquote type="cite"> <pre wrap=""> The interesting part is that each hypervisor seems to think the other is a better host. </pre> </blockquote> </blockquote> <pre wrap="">Where do you get this from? From the line: 'Best remote host 192.168.19.20 (id: 1, score: 2400)' ? I assume this is not the case; HA broker just looking for the best remote candidate. But I have also trouble with this behavior; esp. when I had the cluster in global maintenance. I resolve this by stating hosted engine manually in in global maintenance and waiting for {"health": "good", "vm": "up", "detail": "up"} and disabling global maintenance afterwards. I found the HA feature is indeed working - and tried out best by manually stopping the engine service (service hosted-engine stop). IIRC This should trigger a failover and reboot of the engine. </pre> <blockquote type="cite"> <pre wrap="">The two machines are identical, so there's no reason I </pre> <blockquote type="cite"> <pre wrap="">can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable. May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine. regards, John _______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <pre wrap=""> Cheers, Daniel </pre> <fieldset class="mimeAttachmentHeader"></fieldset> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> </body> </html> --------------010306090902080504020806--

Daniel Helgenberger

19 Aug 19 Aug

5 p.m.

--=-X7tdJfzKIr6253m47i4Q Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Di, 2014-08-19 at 07:42 +1000, John Gardeniers wrote:

...

Hi Daniel, =20 As per my original post, each host believed the *other* is a better candidate, with the result that neither would start the engine. As you may have read by now, the bug has been confirmed and a fix has been proposed. Indeed! I run in this bug also. I also applied Jiris fix.

...

Your claim that HA is working is incorrect. A system that requires manual intervention when something goes wrong is not HA. =20 regards, John =20 =20 On 18/08/14 19:18, Daniel Helgenberger wrote: =20

...
Hello John, =20 =20 On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote:

...
----- Original Message -----

...
From: "John Gardeniers" <jgardeniers@objectmastery.com> To: "users" <users@ovirt.org> Sent: Wednesday, July 23, 2014 4:29:45 PM Subject: [ovirt-users] Self-hosted engine won't start =20 Hi All, =20 I have created a lab with 2 hypervisors and a self-hosted engine. T= oday I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would ha=

However, for some reason one of my hosts showed a score of 2000; this is why it was working for me it seems. ppen

...

...
...
...
when the engine was rebooted. =20 When the engine didn't restart I re-ran hosted-engine --set-maintenance=3Dnone and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still h= adn't restarted, so I then tried rebooting both hypervisers. After an hou= r there was still no sign of the engine starting. The agent logs don'= t help me much. The following bits are repeated over and over. =20 ovirt1 (192.168.19.20): =20 MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlin= k.BrokerLink::(notify) Trying: notify time=3D1406157520.27 type=3Dstate_transition detail=3DEngineDown-EngineDown hostname=3D'ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlin= k.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDow= n) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.host= ed_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.host= ed_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400) =20 ovirt2 (192.168.19.21): =20 MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlin= k.BrokerLink::(notify) Trying: notify time=3D1406157484.01 type=3Dstate_transition detail=3DEngineDown-EngineDown hostname=3D'ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlin= k.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDow= n) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.host= ed_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.host= ed_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400) =20 From the above information I decided to simply shut down one hyperv= isor and see what happens. The engine did start back up again a few minu= tes later. I've seen this behavior, too. =20 Jason =20 The interesting part is that each hypervisor seems to think the oth= er is a better host.=20 Where do you get this from? From the line:=20 'Best remote host 192.168.19.20 (id: 1, score: 2400)' ? =20 I assume this is not the case; HA broker just looking for the best remote candidate.=20 =20 But I have also trouble with this behavior; esp. when I had the cluster in global maintenance. I resolve this by stating hosted engine manually in in global maintenance and waiting for {"health": "good", "vm": "up", "detail": "up"} and disabling global maintenance afterwards. =20 I found the HA feature is indeed working - and tried out best by manually stopping the engine service (service hosted-engine stop). IIRC This should trigger a failover and reboot of the engine. =20 =20 The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little= more than an annoying inconvenience. In a production environment it woul= d be completely unacceptable. =20 May I suggest that this issue be looked into and some means found t= o eliminate this kind of mutual exclusion? e.g. After a few minutes o= f such an issue one hypervisor could be randomly given a slightly hig= her weighting, which should result in it being chosen to start the engi= ne. =20 regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users =20

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users =20 Cheers,=20 Daniel =20 =20

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users =20

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

--=20 Daniel Helgenberger=20 m box bewegtbild GmbH=20 P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19=20 D-10115 BERLIN=20 www.m-box.de www.monkeymen.tv=20 Gesch=C3=A4ftsf=C3=BChrer: Martin Retschitzegger / Michaela G=C3=B6llner Handeslregister: Amtsgericht Charlottenburg / HRB 112767=20 --=20 Daniel Helgenberger=20 m box bewegtbild GmbH=20 P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19=20 D-10115 BERLIN=20 www.m-box.de www.monkeymen.tv=20 Gesch=C3=A4ftsf=C3=BChrer: Martin Retschitzegger / Michaela G=C3=B6llner Handeslregister: Amtsgericht Charlottenburg / HRB 112767=20 --=20 Daniel Helgenberger=20 m box bewegtbild GmbH=20 P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19=20 D-10115 BERLIN=20 www.m-box.de www.monkeymen.tv=20 Gesch=C3=A4ftsf=C3=BChrer: Martin Retschitzegger / Michaela G=C3=B6llner Handeslregister: Amtsgericht Charlottenburg / HRB 112767=20 --=20 Daniel Helgenberger=20 m box bewegtbild GmbH=20 P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19=20 D-10115 BERLIN=20 www.m-box.de www.monkeymen.tv=20 Gesch=C3=A4ftsf=C3=BChrer: Martin Retschitzegger / Michaela G=C3=B6llner Handeslregister: Amtsgericht Charlottenburg / HRB 112767=20 --=20 Daniel Helgenberger=20 m box bewegtbild GmbH=20 P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19=20 D-10115 BERLIN=20 www.m-box.de www.monkeymen.tv=20 Gesch=C3=A4ftsf=C3=BChrer: Martin Retschitzegger / Michaela G=C3=B6llner Handeslregister: Amtsgericht Charlottenburg / HRB 112767=20 --=20 Daniel Helgenberger=20 m box bewegtbild GmbH=20 P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19=20 D-10115 BERLIN=20 www.m-box.de www.monkeymen.tv=20 Gesch=C3=A4ftsf=C3=BChrer: Martin Retschitzegger / Michaela G=C3=B6llner Handeslregister: Amtsgericht Charlottenburg / HRB 112767=20 --=-X7tdJfzKIr6253m47i4Q Content-Type: application/x-pkcs7-signature; name="smime.p7s" Content-Disposition: attachment; filename="smime.p7s" Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIN9zCCBFcw ggM/oAMCAQICCwQAAAAAAS9O4TFGMA0GCSqGSIb3DQEBBQUAMFcxCzAJBgNVBAYTAkJFMRkwFwYD VQQKExBHbG9iYWxTaWduIG52LXNhMRAwDgYDVQQLEwdSb290IENBMRswGQYDVQQDExJHbG9iYWxT aWduIFJvb3QgQ0EwHhcNMTEwNDEzMTAwMDAwWhcNMTkwNDEzMTAwMDAwWjBUMQswCQYDVQQGEwJC RTEZMBcGA1UEChMQR2xvYmFsU2lnbiBudi1zYTEqMCgGA1UEAxMhR2xvYmFsU2lnbiBQZXJzb25h bFNpZ24gMiBDQSAtIEcyMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAwWtB+TXs+BJ9 3SJRaV+3uRNGJ3cUO+MTgW8+5HQXfgy19CzkDI1T1NwwICi/bo4R/mYR5FEWx91//eE0ElC/89iY 7GkL0tDasmVx4TOXnrqrsziUcxEPPqHRE8x4NhtBK7+8o0nsMIJMA1gyZ2FA5To2Ew1BBuvovvDJ +Nua3qOCNBNu+8A+eNpJlVnlu/qB7+XWaPXtUMlsIikxD+gREFVUgYE4VzBuLa2kkg0VLd09XkE2 ceRDm6YgRATuDk6ogUyX4OLxCGIJF8yi6Z37M0wemDA6Uff0EuqdwDQd5HwG/rernUjt1grLdAxq 8BwywRRg0eFHmE+ShhpyO3Fi+wIDAQABo4IBJTCCASEwDgYDVR0PAQH/BAQDAgEGMBIGA1UdEwEB /wQIMAYBAf8CAQAwHQYDVR0OBBYEFD8V0m18L+cxnkMKBqiUbCw7xe5lMEcGA1UdIARAMD4wPAYE VR0gADA0MDIGCCsGAQUFBwIBFiZodHRwczovL3d3dy5nbG9iYWxzaWduLmNvbS9yZXBvc2l0b3J5 LzAzBgNVHR8ELDAqMCigJqAkhiJodHRwOi8vY3JsLmdsb2JhbHNpZ24ubmV0L3Jvb3QuY3JsMD0G CCsGAQUFBwEBBDEwLzAtBggrBgEFBQcwAYYhaHR0cDovL29jc3AuZ2xvYmFsc2lnbi5jb20vcm9v dHIxMB8GA1UdIwQYMBaAFGB7ZhpFDZfKiVAvfQTNNKj//P1LMA0GCSqGSIb3DQEBBQUAA4IBAQDI WOF8oQHpI41wO21cUvjE819juuGa05F5yK/ESqW+9th9vfhG92eaBSLViTIJV7gfCFbt11WexfK/ 44NeiJMfi5wX6sK7Xnt8QIK5lH7ZX1Wg/zK1cXjrgRaYUOX/MA+PmuRm4gWV0zFwYOK2uv4OFgaM mVr+8en7K1aQY2ecI9YhEaDWOcSGj6SN8DvzPdE4G4tBk4/aIsUged9sGDqRYweKla3LTNjXPps1 Y+zsVbgHLtjdOIB0YZ1hrlAQcY2L/b+V+Yyoi7CMdOtmm1Rm6Jh5ILbwQTjlUCkgu5yVdfs9LDKc M0SPeCldkjfaGVSd+nURMOUy3hfxsMVux9+FMIIEyjCCA7KgAwIBAgIRAJZpZsDepakv5CafojXo PKcwDQYJKoZIhvcNAQEFBQAwVDELMAkGA1UEBhMCQkUxGTAXBgNVBAoTEEdsb2JhbFNpZ24gbnYt c2ExKjAoBgNVBAMTIUdsb2JhbFNpZ24gUGVyc29uYWxTaWduIDIgQ0EgLSBHMjAeFw0xMzA4Mjcx NjU3NThaFw0xNjA4MjcxNjU3NThaMFgxCzAJBgNVBAYTAkRFMRwwGgYDVQQDExNEYW5pZWwgSGVs Z2VuYmVyZ2VyMSswKQYJKoZIhvcNAQkBFhxkYW5pZWwuaGVsZ2VuYmVyZ2VyQG0tYm94LmRlMIIB IjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAzgFDm8+SeTU4Yt3WopJQgqZAuuNxyMlxiPuq 0C0D581goXz2nVVjhTCIVwX2MqWYD1Dyjy1hLHXothgWgZaiQ1EB4oVdmIFmIfIjR6SkR/Gjw3lx MwJzEpxJhZXyyrOYE8Kgw2maJWgLx5zw2/lKpcffhVW0OY0t+JWWxPKiYFcAmQnb+fleonM8sUZZ ZES08uRVVL67jbq+3+E2xCLlqQ2iJ1h5ej3wlyuZ4CkUnfMHYrG8zOIfHwsPirWACX026a1flgts Kl1Yv0CRZ1c5qujcP3OPpDovIbBr9RBStl2DcFdzTuGMdmfp32963VLOlvKpClPMzrfJeJfWZ4Qy UwIDAQABo4IBkTCCAY0wDgYDVR0PAQH/BAQDAgWgMEwGA1UdIARFMEMwQQYJKwYBBAGgMgEoMDQw MgYIKwYBBQUHAgEWJmh0dHBzOi8vd3d3Lmdsb2JhbHNpZ24uY29tL3JlcG9zaXRvcnkvMCcGA1Ud EQQgMB6BHGRhbmllbC5oZWxnZW5iZXJnZXJAbS1ib3guZGUwCQYDVR0TBAIwADAdBgNVHSUEFjAU BggrBgEFBQcDAgYIKwYBBQUHAwQwQwYDVR0fBDwwOjA4oDagNIYyaHR0cDovL2NybC5nbG9iYWxz aWduLmNvbS9ncy9nc3BlcnNvbmFsc2lnbjJnMi5jcmwwVQYIKwYBBQUHAQEESTBHMEUGCCsGAQUF BzAChjlodHRwOi8vc2VjdXJlLmdsb2JhbHNpZ24uY29tL2NhY2VydC9nc3BlcnNvbmFsc2lnbjJn Mi5jcnQwHQYDVR0OBBYEFLw0UD+6l35aKnDaePxEP8K35HYZMB8GA1UdIwQYMBaAFD8V0m18L+cx nkMKBqiUbCw7xe5lMA0GCSqGSIb3DQEBBQUAA4IBAQBdVOm7h+E4sRMBbTN1tCIjAEgxmB5U0mdZ XcawzEHLJxTrc/5YFBMGX2qPju8cuZV14XszMfRBJdlJz1Od+voJggianIhnFEAakCxaa1l/cmJ5 EDT6PgZAkXbMB5rU1dhegb35lJJkcFLEpR2tF1V0TfbSe5UZNPYeMQjYsRhs69pfKLoeGm4dSLK7 gsPT5EhPd+JPyNSIootOwClMP4CTxIsXQgRI5IDqG2Ku/r2YMMLsqWD11PtAE87t2mgohQ6V1XdW FqGd1V+wN98oPumRRS8bld+1gRA7GVYMnO5MF6p//iHFcy3MVT05ojqgomMt+voH5cFzrHA61z80 xaZ6MIIEyjCCA7KgAwIBAgIRAJZpZsDepakv5CafojXoPKcwDQYJKoZIhvcNAQEFBQAwVDELMAkG A1UEBhMCQkUxGTAXBgNVBAoTEEdsb2JhbFNpZ24gbnYtc2ExKjAoBgNVBAMTIUdsb2JhbFNpZ24g UGVyc29uYWxTaWduIDIgQ0EgLSBHMjAeFw0xMzA4MjcxNjU3NThaFw0xNjA4MjcxNjU3NThaMFgx CzAJBgNVBAYTAkRFMRwwGgYDVQQDExNEYW5pZWwgSGVsZ2VuYmVyZ2VyMSswKQYJKoZIhvcNAQkB FhxkYW5pZWwuaGVsZ2VuYmVyZ2VyQG0tYm94LmRlMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIB CgKCAQEAzgFDm8+SeTU4Yt3WopJQgqZAuuNxyMlxiPuq0C0D581goXz2nVVjhTCIVwX2MqWYD1Dy jy1hLHXothgWgZaiQ1EB4oVdmIFmIfIjR6SkR/Gjw3lxMwJzEpxJhZXyyrOYE8Kgw2maJWgLx5zw 2/lKpcffhVW0OY0t+JWWxPKiYFcAmQnb+fleonM8sUZZZES08uRVVL67jbq+3+E2xCLlqQ2iJ1h5 ej3wlyuZ4CkUnfMHYrG8zOIfHwsPirWACX026a1flgtsKl1Yv0CRZ1c5qujcP3OPpDovIbBr9RBS tl2DcFdzTuGMdmfp32963VLOlvKpClPMzrfJeJfWZ4QyUwIDAQABo4IBkTCCAY0wDgYDVR0PAQH/ BAQDAgWgMEwGA1UdIARFMEMwQQYJKwYBBAGgMgEoMDQwMgYIKwYBBQUHAgEWJmh0dHBzOi8vd3d3 Lmdsb2JhbHNpZ24uY29tL3JlcG9zaXRvcnkvMCcGA1UdEQQgMB6BHGRhbmllbC5oZWxnZW5iZXJn ZXJAbS1ib3guZGUwCQYDVR0TBAIwADAdBgNVHSUEFjAUBggrBgEFBQcDAgYIKwYBBQUHAwQwQwYD VR0fBDwwOjA4oDagNIYyaHR0cDovL2NybC5nbG9iYWxzaWduLmNvbS9ncy9nc3BlcnNvbmFsc2ln bjJnMi5jcmwwVQYIKwYBBQUHAQEESTBHMEUGCCsGAQUFBzAChjlodHRwOi8vc2VjdXJlLmdsb2Jh bHNpZ24uY29tL2NhY2VydC9nc3BlcnNvbmFsc2lnbjJnMi5jcnQwHQYDVR0OBBYEFLw0UD+6l35a KnDaePxEP8K35HYZMB8GA1UdIwQYMBaAFD8V0m18L+cxnkMKBqiUbCw7xe5lMA0GCSqGSIb3DQEB BQUAA4IBAQBdVOm7h+E4sRMBbTN1tCIjAEgxmB5U0mdZXcawzEHLJxTrc/5YFBMGX2qPju8cuZV1 4XszMfRBJdlJz1Od+voJggianIhnFEAakCxaa1l/cmJ5EDT6PgZAkXbMB5rU1dhegb35lJJkcFLE pR2tF1V0TfbSe5UZNPYeMQjYsRhs69pfKLoeGm4dSLK7gsPT5EhPd+JPyNSIootOwClMP4CTxIsX QgRI5IDqG2Ku/r2YMMLsqWD11PtAE87t2mgohQ6V1XdWFqGd1V+wN98oPumRRS8bld+1gRA7GVYM nO5MF6p//iHFcy3MVT05ojqgomMt+voH5cFzrHA61z80xaZ6MYIC5zCCAuMCAQEwaTBUMQswCQYD VQQGEwJCRTEZMBcGA1UEChMQR2xvYmFsU2lnbiBudi1zYTEqMCgGA1UEAxMhR2xvYmFsU2lnbiBQ ZXJzb25hbFNpZ24gMiBDQSAtIEcyAhEAlmlmwN6lqS/kJp+iNeg8pzAJBgUrDgMCGgUAoIIBUzAY BgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDA4MTkxNTAwMjNaMCMG CSqGSIb3DQEJBDEWBBTzUUmM9u511Ut8jbOlq52V0hui2zB4BgkrBgEEAYI3EAQxazBpMFQxCzAJ BgNVBAYTAkJFMRkwFwYDVQQKExBHbG9iYWxTaWduIG52LXNhMSowKAYDVQQDEyFHbG9iYWxTaWdu IFBlcnNvbmFsU2lnbiAyIENBIC0gRzICEQCWaWbA3qWpL+Qmn6I16DynMHoGCyqGSIb3DQEJEAIL MWugaTBUMQswCQYDVQQGEwJCRTEZMBcGA1UEChMQR2xvYmFsU2lnbiBudi1zYTEqMCgGA1UEAxMh R2xvYmFsU2lnbiBQZXJzb25hbFNpZ24gMiBDQSAtIEcyAhEAlmlmwN6lqS/kJp+iNeg8pzANBgkq hkiG9w0BAQEFAASCAQArLjb+1wMIs4bhB2kRzHieh1GaUnEQ3Nw8qk4YesCtjK3j/hKUW+fRY1mo LUTkibOl4yb+6INzwF/xM1UoPbILpQQ6praYShPo1cteqcKMroCzKp1P0ppARYlcr9rbdE5mYrdf Kt0i8sbsSALV0qoxRQheDcKDNRpCHxHC64VJHa6M3bElcYbRQNplkQ31lowtZ7dda0f8SbV20EXV ruSlg6o5w/TjXQmWUn9cqW6sgm/j/rDD1YrmILViz/Ls+b4FwV8wLVVmv6Iz+WF4KKJDgtBrebs6 H5bsxCR7CIj89RjVNLOuJt52zugy+pTIFVD3X1SrlMnkkX76aOuqaKTsAAAAAAAA --=-X7tdJfzKIr6253m47i4Q--

Jiri Moskovcak

24 Jul 24 Jul

11:10 a.m.

Hi, please provide the the exact versions of ovirt-hosted-engine-ha and all logs from /var/log/ovirt-hosted-engine-ha/ Thank you, Jirka On 07/24/2014 01:29 AM, John Gardeniers wrote:

...

Hi All,

I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted.

When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over.

ovirt1 (192.168.19.20):

MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400)

ovirt2 (192.168.19.21):

MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400)

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later.

The interesting part is that each hypervisor seems to think the other is a better host. The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable.

May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine.

regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

John Gardeniers

11:37 p.m.

Hi Jiri, Perhaps you can tell me how to determine the exact version of ovirt-hosted-engine-ha. As for the logs, I am not going to attach 60MB of logs to an email, nor can I see any imaginagle reason for you wanting to see them all, as the bulk is historical. I have already included the *relevant* sections. However, if you think there may be some other section that may help you feel free to be more explicit about what you are looking for. Right now I fail to understand what you might hope to see in logs from several weeks ago that you can't get from the last day or so. regards, John On 24/07/14 19:10, Jiri Moskovcak wrote:

...

Hi, please provide the the exact versions of ovirt-hosted-engine-ha and all logs from /var/log/ovirt-hosted-engine-ha/

Thank you, Jirka

On 07/24/2014 01:29 AM, John Gardeniers wrote:

...
Hi All,

I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted.

When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over.

ovirt1 (192.168.19.20):

MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.21 (id: 2, score: 2400)

ovirt2 (192.168.19.21):

MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.20 (id: 1, score: 2400)

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later.

The interesting part is that each hypervisor seems to think the other is a better host. The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable.

May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine.

regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

Jiri Moskovcak

25 Jul 25 Jul

9:47 a.m.

On 07/24/2014 11:37 PM, John Gardeniers wrote:

...

Hi Jiri,

Perhaps you can tell me how to determine the exact version of ovirt-hosted-engine-ha.

Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha

...

As for the logs, I am not going to attach 60MB of logs to an email,

- there are other ways to share the logs

...

nor can I see any imaginagle reason for you wanting to see them all, as the bulk is historical. I have already included the *relevant* sections. However, if you think there may be some other section that may help you feel free to be more explicit about what you are looking for. Right now I fail to understand what you might hope to see in logs from several weeks ago that you can't get from the last day or so.

It's a standard way, people tend to think that they know what is a relevant part of a log, but in many cases they fail. Asking for the whole logs has proven to be faster than trying to find the relevant part through the user. And you're right, I don't need the logs from last week, just logs since the last start of the services when you observed the problem. Regards, Jirka

...

regards, John

On 24/07/14 19:10, Jiri Moskovcak wrote:

...
Hi, please provide the the exact versions of ovirt-hosted-engine-ha and all logs from /var/log/ovirt-hosted-engine-ha/

Thank you, Jirka

On 07/24/2014 01:29 AM, John Gardeniers wrote:

...
Hi All,

I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted.

When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over.

ovirt1 (192.168.19.20):

MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.21 (id: 2, score: 2400)

ovirt2 (192.168.19.21):

MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.20 (id: 1, score: 2400)

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later.

The interesting part is that each hypervisor seems to think the other is a better host. The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable.

May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine.

regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

John Gardeniers

28 Jul 28 Jul

2:57 a.m.

Hi Jira, Version: ovirt-hosted-engine-ha-1.1.5-1.el6.noarch Attached are the logs. Thanks for looking. Regards, John On 25/07/14 17:47, Jiri Moskovcak wrote:

...

On 07/24/2014 11:37 PM, John Gardeniers wrote:

...
Hi Jiri,

Perhaps you can tell me how to determine the exact version of ovirt-hosted-engine-ha.

Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha

...
As for the logs, I am not going to attach 60MB of logs to an email,

- there are other ways to share the logs

...
nor can I see any imaginagle reason for you wanting to see them all, as the bulk is historical. I have already included the *relevant* sections. However, if you think there may be some other section that may help you feel free to be more explicit about what you are looking for. Right now I fail to understand what you might hope to see in logs from several weeks ago that you can't get from the last day or so.

It's a standard way, people tend to think that they know what is a relevant part of a log, but in many cases they fail. Asking for the whole logs has proven to be faster than trying to find the relevant part through the user. And you're right, I don't need the logs from last week, just logs since the last start of the services when you observed the problem.

Regards, Jirka

...
regards, John

On 24/07/14 19:10, Jiri Moskovcak wrote:

...
Hi, please provide the the exact versions of ovirt-hosted-engine-ha and all logs from /var/log/ovirt-hosted-engine-ha/

Thank you, Jirka

On 07/24/2014 01:29 AM, John Gardeniers wrote:

...
Hi All,

I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted.

When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over.

ovirt1 (192.168.19.20):

MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.21 (id: 2, score: 2400)

ovirt2 (192.168.19.21):

MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.20 (id: 1, score: 2400)

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later.

The interesting part is that each hypervisor seems to think the other is a better host. The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable.

May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine.

regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

Jiri Moskovcak

29 Jul 29 Jul

2:52 p.m.

Hi John, thanks for the logs. Seems like the engine is running on host2 and it decides that it doesn't have the best score and shuts the engine down and then neither of them want's to start the vm until you restart the host2. Unfortunately the logs doesn't contain the part from host1 from 2014-07-24 09:XX which I'd like to investigate because it might contain the information why host1 refused to start the vm when host2 killed it. Regards, Jirka On 07/28/2014 02:57 AM, John Gardeniers wrote:

...

Hi Jira,

Version: ovirt-hosted-engine-ha-1.1.5-1.el6.noarch

Attached are the logs. Thanks for looking.

Regards, John

On 25/07/14 17:47, Jiri Moskovcak wrote:

...
On 07/24/2014 11:37 PM, John Gardeniers wrote:

...
Hi Jiri,

Perhaps you can tell me how to determine the exact version of ovirt-hosted-engine-ha.

Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha

...
As for the logs, I am not going to attach 60MB of logs to an email,

- there are other ways to share the logs

...
nor can I see any imaginagle reason for you wanting to see them all, as the bulk is historical. I have already included the *relevant* sections. However, if you think there may be some other section that may help you feel free to be more explicit about what you are looking for. Right now I fail to understand what you might hope to see in logs from several weeks ago that you can't get from the last day or so.

It's a standard way, people tend to think that they know what is a relevant part of a log, but in many cases they fail. Asking for the whole logs has proven to be faster than trying to find the relevant part through the user. And you're right, I don't need the logs from last week, just logs since the last start of the services when you observed the problem.

Regards, Jirka

...
regards, John

On 24/07/14 19:10, Jiri Moskovcak wrote:

...
Hi, please provide the the exact versions of ovirt-hosted-engine-ha and all logs from /var/log/ovirt-hosted-engine-ha/

Thank you, Jirka

On 07/24/2014 01:29 AM, John Gardeniers wrote:

...
Hi All,

I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted.

When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over.

ovirt1 (192.168.19.20):

MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.21 (id: 2, score: 2400)

ovirt2 (192.168.19.21):

MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.20 (id: 1, score: 2400)

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later.

The interesting part is that each hypervisor seems to think the other is a better host. The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable.

May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine.

regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

John Gardeniers

11:41 p.m.

Hi Jiri, Sorry, I can't supply the log because the hosts have been recycled but I'm sure it would have contained exactly the same information that you already have from host2. It's a classic deadlock situation that should never be allowed to happen. A simple and time proven solution was in my original post. The reason for recycling the hosts is that I discovered yesterday that although the engine was still running it could not be accessed in any way. Upon further finding that there was no way to get it restarted I decided to abandon the whole idea of self-hosting until such time as I see an indication that it's production ready. regards, John On 29/07/14 22:52, Jiri Moskovcak wrote:

...

Hi John, thanks for the logs. Seems like the engine is running on host2 and it decides that it doesn't have the best score and shuts the engine down and then neither of them want's to start the vm until you restart the host2. Unfortunately the logs doesn't contain the part from host1 from 2014-07-24 09:XX which I'd like to investigate because it might contain the information why host1 refused to start the vm when host2 killed it.

Regards, Jirka

On 07/28/2014 02:57 AM, John Gardeniers wrote:

...
Hi Jira,

Version: ovirt-hosted-engine-ha-1.1.5-1.el6.noarch

Attached are the logs. Thanks for looking.

Regards, John

On 25/07/14 17:47, Jiri Moskovcak wrote:

...
On 07/24/2014 11:37 PM, John Gardeniers wrote:

...
Hi Jiri,

Perhaps you can tell me how to determine the exact version of ovirt-hosted-engine-ha.

Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha

...
As for the logs, I am not going to attach 60MB of logs to an email,

- there are other ways to share the logs

...
nor can I see any imaginagle reason for you wanting to see them all, as the bulk is historical. I have already included the *relevant* sections. However, if you think there may be some other section that may help you feel free to be more explicit about what you are looking for. Right now I fail to understand what you might hope to see in logs from several weeks ago that you can't get from the last day or so.

It's a standard way, people tend to think that they know what is a relevant part of a log, but in many cases they fail. Asking for the whole logs has proven to be faster than trying to find the relevant part through the user. And you're right, I don't need the logs from last week, just logs since the last start of the services when you observed the problem.

Regards, Jirka

...
regards, John

On 24/07/14 19:10, Jiri Moskovcak wrote:

...
Hi, please provide the the exact versions of ovirt-hosted-engine-ha and all logs from /var/log/ovirt-hosted-engine-ha/

Thank you, Jirka

On 07/24/2014 01:29 AM, John Gardeniers wrote:

...
Hi All,

I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted.

When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over.

ovirt1 (192.168.19.20):

MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.21 (id: 2, score: 2400)

ovirt2 (192.168.19.21):

MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)

Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)

Best remote host 192.168.19.20 (id: 1, score: 2400)

From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later.

The interesting part is that each hypervisor seems to think the other is a better host. The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable.

May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine.

regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

______________________________________________________________________

This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

Jiri Moskovcak

14 Aug 14 Aug

2:57 p.m.

Hi John, after a deeper look I realized that you're probably facing [1]. The patch is ready and I will also backport it to 3.4 branch. --Jirka [1] https://bugzilla.redhat.com/show_bug.cgi?id=1093638 On 07/29/2014 11:41 PM, John Gardeniers wrote:

...

Hi Jiri,

Sorry, I can't supply the log because the hosts have been recycled but I'm sure it would have contained exactly the same information that you already have from host2. It's a classic deadlock situation that should never be allowed to happen. A simple and time proven solution was in my original post.

The reason for recycling the hosts is that I discovered yesterday that although the engine was still running it could not be accessed in any way. Upon further finding that there was no way to get it restarted I decided to abandon the whole idea of self-hosting until such time as I see an indication that it's production ready.

regards, John

On 29/07/14 22:52, Jiri Moskovcak wrote:

...
Hi John, thanks for the logs. Seems like the engine is running on host2 and it decides that it doesn't have the best score and shuts the engine down and then neither of them want's to start the vm until you restart the host2. Unfortunately the logs doesn't contain the part from host1 from 2014-07-24 09:XX which I'd like to investigate because it might contain the information why host1 refused to start the vm when host2 killed it.

Regards, Jirka

On 07/28/2014 02:57 AM, John Gardeniers wrote:

...
Hi Jira,

Version: ovirt-hosted-engine-ha-1.1.5-1.el6.noarch

Attached are the logs. Thanks for looking.

Regards, John

On 25/07/14 17:47, Jiri Moskovcak wrote:

...
On 07/24/2014 11:37 PM, John Gardeniers wrote:

...
Hi Jiri,

Perhaps you can tell me how to determine the exact version of ovirt-hosted-engine-ha.

Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha

...
As for the logs, I am not going to attach 60MB of logs to an email,

- there are other ways to share the logs

...
nor can I see any imaginagle reason for you wanting to see them all, as the bulk is historical. I have already included the *relevant* sections. However, if you think there may be some other section that may help you feel free to be more explicit about what you are looking for. Right now I fail to understand what you might hope to see in logs from several weeks ago that you can't get from the last day or so.

It's a standard way, people tend to think that they know what is a relevant part of a log, but in many cases they fail. Asking for the whole logs has proven to be faster than trying to find the relevant part through the user. And you're right, I don't need the logs from last week, just logs since the last start of the services when you observed the problem.

Regards, Jirka

...
regards, John

On 24/07/14 19:10, Jiri Moskovcak wrote:

...
Hi, please provide the the exact versions of ovirt-hosted-engine-ha and all logs from /var/log/ovirt-hosted-engine-ha/

Thank you, Jirka

On 07/24/2014 01:29 AM, John Gardeniers wrote: > Hi All, > > I have created a lab with 2 hypervisors and a self-hosted engine. > Today > I followed the upgrade instructions as described in > http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I > didn't really do an upgrade but simply wanted to test what would > happen > when the engine was rebooted. > > When the engine didn't restart I re-ran hosted-engine > --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and > ovirt-ha-broker services on both nodes. 15 minutes later it still > hadn't > restarted, so I then tried rebooting both hypervisers. After an hour > there was still no sign of the engine starting. The agent logs don't > help me much. The following bits are repeated over and over. > > ovirt1 (192.168.19.20): > > MainThread::INFO::2014-07-24 > 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > > > > Trying: notify time=1406157520.27 type=state_transition > detail=EngineDown-EngineDown hostname='ovirt1.om.net' > MainThread::INFO::2014-07-24 > 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > > > > Success, was notification of state_transition > (EngineDown-EngineDown) > sent? ignored > MainThread::INFO::2014-07-24 > 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) > > > > Current state EngineDown (score: 2400) > MainThread::INFO::2014-07-24 > 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) > > > > Best remote host 192.168.19.21 (id: 2, score: 2400) > > ovirt2 (192.168.19.21): > > MainThread::INFO::2014-07-24 > 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > > > > Trying: notify time=1406157484.01 type=state_transition > detail=EngineDown-EngineDown hostname='ovirt2.om.net' > MainThread::INFO::2014-07-24 > 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > > > > Success, was notification of state_transition > (EngineDown-EngineDown) > sent? ignored > MainThread::INFO::2014-07-24 > 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) > > > > Current state EngineDown (score: 2400) > MainThread::INFO::2014-07-24 > 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) > > > > Best remote host 192.168.19.20 (id: 1, score: 2400) > > From the above information I decided to simply shut down one > hypervisor > and see what happens. The engine did start back up again a few > minutes > later. > > The interesting part is that each hypervisor seems to think the > other is > a better host. The two machines are identical, so there's no > reason I > can see for this odd behaviour. In a lab environment this is little > more > than an annoying inconvenience. In a production environment it > would be > completely unacceptable. > > May I suggest that this issue be looked into and some means found to > eliminate this kind of mutual exclusion? e.g. After a few minutes of > such an issue one hypervisor could be randomly given a slightly > higher > weighting, which should result in it being chosen to start the > engine. > > regards, > John > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >

______________________________________________________________________

This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

John Gardeniers

18 Aug 18 Aug

12:17 a.m.

Hi Jirka, Thanks for the update. It sounds like the same bug but with a few extra issues thrown in. e.g. Comment 9 seems to me to be a completely separate bug, although it may affect the issue I reported. I can't see any mention of how the problem is being resolved, which I am interested in, but will keep an eye on it. I'll try the patched version when I get the time and enthusiasm to give it another crack. regards, John On 14/08/14 22:57, Jiri Moskovcak wrote:

...

Hi John, after a deeper look I realized that you're probably facing [1]. The patch is ready and I will also backport it to 3.4 branch.

--Jirka

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1093638

On 07/29/2014 11:41 PM, John Gardeniers wrote:

...
Hi Jiri,

Sorry, I can't supply the log because the hosts have been recycled but I'm sure it would have contained exactly the same information that you already have from host2. It's a classic deadlock situation that should never be allowed to happen. A simple and time proven solution was in my original post.

The reason for recycling the hosts is that I discovered yesterday that although the engine was still running it could not be accessed in any way. Upon further finding that there was no way to get it restarted I decided to abandon the whole idea of self-hosting until such time as I see an indication that it's production ready.

regards, John

On 29/07/14 22:52, Jiri Moskovcak wrote:

...
Hi John, thanks for the logs. Seems like the engine is running on host2 and it decides that it doesn't have the best score and shuts the engine down and then neither of them want's to start the vm until you restart the host2. Unfortunately the logs doesn't contain the part from host1 from 2014-07-24 09:XX which I'd like to investigate because it might contain the information why host1 refused to start the vm when host2 killed it.

Regards, Jirka

On 07/28/2014 02:57 AM, John Gardeniers wrote:

...
Hi Jira,

Version: ovirt-hosted-engine-ha-1.1.5-1.el6.noarch

Attached are the logs. Thanks for looking.

Regards, John

On 25/07/14 17:47, Jiri Moskovcak wrote:

...
On 07/24/2014 11:37 PM, John Gardeniers wrote:

...
Hi Jiri,

Perhaps you can tell me how to determine the exact version of ovirt-hosted-engine-ha.

Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha

...
As for the logs, I am not going to attach 60MB of logs to an email,

- there are other ways to share the logs

...
nor can I see any imaginagle reason for you wanting to see them all, as the bulk is historical. I have already included the *relevant* sections. However, if you think there may be some other section that may help you feel free to be more explicit about what you are looking for. Right now I fail to understand what you might hope to see in logs from several weeks ago that you can't get from the last day or so.

It's a standard way, people tend to think that they know what is a relevant part of a log, but in many cases they fail. Asking for the whole logs has proven to be faster than trying to find the relevant part through the user. And you're right, I don't need the logs from last week, just logs since the last start of the services when you observed the problem.

Regards, Jirka

...
regards, John

On 24/07/14 19:10, Jiri Moskovcak wrote: > Hi, please provide the the exact versions of ovirt-hosted-engine-ha > and all logs from /var/log/ovirt-hosted-engine-ha/ > > Thank you, > Jirka > > On 07/24/2014 01:29 AM, John Gardeniers wrote: >> Hi All, >> >> I have created a lab with 2 hypervisors and a self-hosted engine. >> Today >> I followed the upgrade instructions as described in >> http://www.ovirt.org/Hosted_Engine_Howto and rebooted the >> engine. I >> didn't really do an upgrade but simply wanted to test what would >> happen >> when the engine was rebooted. >> >> When the engine didn't restart I re-ran hosted-engine >> --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and >> ovirt-ha-broker services on both nodes. 15 minutes later it still >> hadn't >> restarted, so I then tried rebooting both hypervisers. After an >> hour >> there was still no sign of the engine starting. The agent logs >> don't >> help me much. The following bits are repeated over and over. >> >> ovirt1 (192.168.19.20): >> >> MainThread::INFO::2014-07-24 >> 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >> >> >> >> >> Trying: notify time=1406157520.27 type=state_transition >> detail=EngineDown-EngineDown hostname='ovirt1.om.net' >> MainThread::INFO::2014-07-24 >> 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >> >> >> >> >> Success, was notification of state_transition >> (EngineDown-EngineDown) >> sent? ignored >> MainThread::INFO::2014-07-24 >> 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >> >> >> >> >> Current state EngineDown (score: 2400) >> MainThread::INFO::2014-07-24 >> 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >> >> >> >> >> Best remote host 192.168.19.21 (id: 2, score: 2400) >> >> ovirt2 (192.168.19.21): >> >> MainThread::INFO::2014-07-24 >> 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >> >> >> >> >> Trying: notify time=1406157484.01 type=state_transition >> detail=EngineDown-EngineDown hostname='ovirt2.om.net' >> MainThread::INFO::2014-07-24 >> 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >> >> >> >> >> Success, was notification of state_transition >> (EngineDown-EngineDown) >> sent? ignored >> MainThread::INFO::2014-07-24 >> 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >> >> >> >> >> Current state EngineDown (score: 2400) >> MainThread::INFO::2014-07-24 >> 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >> >> >> >> >> Best remote host 192.168.19.20 (id: 1, score: 2400) >> >> From the above information I decided to simply shut down one >> hypervisor >> and see what happens. The engine did start back up again a few >> minutes >> later. >> >> The interesting part is that each hypervisor seems to think the >> other is >> a better host. The two machines are identical, so there's no >> reason I >> can see for this odd behaviour. In a lab environment this is >> little >> more >> than an annoying inconvenience. In a production environment it >> would be >> completely unacceptable. >> >> May I suggest that this issue be looked into and some means >> found to >> eliminate this kind of mutual exclusion? e.g. After a few >> minutes of >> such an issue one hypervisor could be randomly given a slightly >> higher >> weighting, which should result in it being chosen to start the >> engine. >> >> regards, >> John >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> > > > ______________________________________________________________________ > > > This email has been scanned by the Symantec Email Security.cloud > service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ > >

______________________________________________________________________

This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

Jiri Moskovcak

8:29 a.m.

Hi John, this is the patch fixing your problem [1]. It can be found at the top of that bz page. It's really a simple change, so if you want you can just change it manually on your system without waiting for a patches version. --Jirka [1] http://gerrit.ovirt.org/#/c/31510/2/ovirt_hosted_engine_ha/agent/states.py On 08/18/2014 12:17 AM, John Gardeniers wrote:

...

Hi Jirka,

Thanks for the update. It sounds like the same bug but with a few extra issues thrown in. e.g. Comment 9 seems to me to be a completely separate bug, although it may affect the issue I reported.

I can't see any mention of how the problem is being resolved, which I am interested in, but will keep an eye on it.

I'll try the patched version when I get the time and enthusiasm to give it another crack.

regards, John

On 14/08/14 22:57, Jiri Moskovcak wrote:

...
Hi John, after a deeper look I realized that you're probably facing [1]. The patch is ready and I will also backport it to 3.4 branch.

--Jirka

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1093638

On 07/29/2014 11:41 PM, John Gardeniers wrote:

...
Hi Jiri,

Sorry, I can't supply the log because the hosts have been recycled but I'm sure it would have contained exactly the same information that you already have from host2. It's a classic deadlock situation that should never be allowed to happen. A simple and time proven solution was in my original post.

The reason for recycling the hosts is that I discovered yesterday that although the engine was still running it could not be accessed in any way. Upon further finding that there was no way to get it restarted I decided to abandon the whole idea of self-hosting until such time as I see an indication that it's production ready.

regards, John

On 29/07/14 22:52, Jiri Moskovcak wrote:

...
Hi John, thanks for the logs. Seems like the engine is running on host2 and it decides that it doesn't have the best score and shuts the engine down and then neither of them want's to start the vm until you restart the host2. Unfortunately the logs doesn't contain the part from host1 from 2014-07-24 09:XX which I'd like to investigate because it might contain the information why host1 refused to start the vm when host2 killed it.

Regards, Jirka

On 07/28/2014 02:57 AM, John Gardeniers wrote:

...
Hi Jira,

Version: ovirt-hosted-engine-ha-1.1.5-1.el6.noarch

Attached are the logs. Thanks for looking.

Regards, John

On 25/07/14 17:47, Jiri Moskovcak wrote:

...
On 07/24/2014 11:37 PM, John Gardeniers wrote: > Hi Jiri, > > Perhaps you can tell me how to determine the exact version of > ovirt-hosted-engine-ha.

Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha

> As for the logs, I am not going to attach 60MB > of logs to an email,

- there are other ways to share the logs

> nor can I see any imaginagle reason for you wanting > to see them all, as the bulk is historical. I have already included > the > *relevant* sections. However, if you think there may be some other > section that may help you feel free to be more explicit about > what you > are looking for. Right now I fail to understand what you might > hope to > see in logs from several weeks ago that you can't get from the last > day > or so. >

It's a standard way, people tend to think that they know what is a relevant part of a log, but in many cases they fail. Asking for the whole logs has proven to be faster than trying to find the relevant part through the user. And you're right, I don't need the logs from last week, just logs since the last start of the services when you observed the problem.

Regards, Jirka

> regards, > John > > > On 24/07/14 19:10, Jiri Moskovcak wrote: >> Hi, please provide the the exact versions of ovirt-hosted-engine-ha >> and all logs from /var/log/ovirt-hosted-engine-ha/ >> >> Thank you, >> Jirka >> >> On 07/24/2014 01:29 AM, John Gardeniers wrote: >>> Hi All, >>> >>> I have created a lab with 2 hypervisors and a self-hosted engine. >>> Today >>> I followed the upgrade instructions as described in >>> http://www.ovirt.org/Hosted_Engine_Howto and rebooted the >>> engine. I >>> didn't really do an upgrade but simply wanted to test what would >>> happen >>> when the engine was rebooted. >>> >>> When the engine didn't restart I re-ran hosted-engine >>> --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and >>> ovirt-ha-broker services on both nodes. 15 minutes later it still >>> hadn't >>> restarted, so I then tried rebooting both hypervisers. After an >>> hour >>> there was still no sign of the engine starting. The agent logs >>> don't >>> help me much. The following bits are repeated over and over. >>> >>> ovirt1 (192.168.19.20): >>> >>> MainThread::INFO::2014-07-24 >>> 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> >>> >>> >>> >>> Trying: notify time=1406157520.27 type=state_transition >>> detail=EngineDown-EngineDown hostname='ovirt1.om.net' >>> MainThread::INFO::2014-07-24 >>> 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> >>> >>> >>> >>> Success, was notification of state_transition >>> (EngineDown-EngineDown) >>> sent? ignored >>> MainThread::INFO::2014-07-24 >>> 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>> >>> >>> >>> >>> Current state EngineDown (score: 2400) >>> MainThread::INFO::2014-07-24 >>> 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>> >>> >>> >>> >>> Best remote host 192.168.19.21 (id: 2, score: 2400) >>> >>> ovirt2 (192.168.19.21): >>> >>> MainThread::INFO::2014-07-24 >>> 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> >>> >>> >>> >>> Trying: notify time=1406157484.01 type=state_transition >>> detail=EngineDown-EngineDown hostname='ovirt2.om.net' >>> MainThread::INFO::2014-07-24 >>> 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> >>> >>> >>> >>> Success, was notification of state_transition >>> (EngineDown-EngineDown) >>> sent? ignored >>> MainThread::INFO::2014-07-24 >>> 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>> >>> >>> >>> >>> Current state EngineDown (score: 2400) >>> MainThread::INFO::2014-07-24 >>> 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) >>> >>> >>> >>> >>> Best remote host 192.168.19.20 (id: 1, score: 2400) >>> >>> From the above information I decided to simply shut down one >>> hypervisor >>> and see what happens. The engine did start back up again a few >>> minutes >>> later. >>> >>> The interesting part is that each hypervisor seems to think the >>> other is >>> a better host. The two machines are identical, so there's no >>> reason I >>> can see for this odd behaviour. In a lab environment this is >>> little >>> more >>> than an annoying inconvenience. In a production environment it >>> would be >>> completely unacceptable. >>> >>> May I suggest that this issue be looked into and some means >>> found to >>> eliminate this kind of mutual exclusion? e.g. After a few >>> minutes of >>> such an issue one hypervisor could be randomly given a slightly >>> higher >>> weighting, which should result in it being chosen to start the >>> engine. >>> >>> regards, >>> John >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> >> >> >> ______________________________________________________________________ >> >> >> This email has been scanned by the Symantec Email Security.cloud >> service. >> For more information please visit http://www.symanteccloud.com >> ______________________________________________________________________ >> >> >

______________________________________________________________________

This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

4129

Age (days ago)

4156

Last active (days ago)

List overview

Download

13 comments

4 participants

participants (4)

Daniel Helgenberger
Jason Brooks
Jiri Moskovcak
John Gardeniers