oVirt 3.4 - Hosted Engine: Cluster Reboot procedure

--=-QguCZheGA74Z2DLSRg8u Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello ovirt-users, after playing around with my ovirt 3.4 hosted engine two node HA cluster I have devised a procedure on how to restart the whole cluster after a power loss / normal shutdown. This assumes all HA-Nodes have been taken offline. This also applies partly to rebooted HA nodes. Please feel free do ask questions and/or comment on improvements. Most of the things should be obsoleted by future updates anyway. Note 1: The problem IMHO seems to be the non connected nfs storage domain, resulting in the HA-Agent crash / hang. The ha-broker service should be up and running all the time. Please check this. Note 2:=20 My setup consists of two nodes; 'all nodes' means the task has to be performed on every node HA node in the cluster. Node 3: By 'Login' I mean SSH or local access. Part A: SHUTDOWN THE CLUSTER Prerequisite: oVirt HE cluster running, should be taken offline for maintenance: 1. In oVirt, shutdown all VM's except HostedEngine. 2. Login to one cluster node and run 'hosted-engine --set-maintenance --mode=3Dglobal' to put the cluster into global maintenance 3. Login to ovirt engine VM and shut it down with 'shutdown -h now' 4. Login to one cluster node and run 'hosted-engine --vm-status' to check if the engine is really down.=20 5. Shutdown all HA nodes subsequently. Part B: STARTING THE CLUSTER Prerequisite: oVirt HE cluster down, NFS storage server running and exporting the vdsm share. 1. Start all nodes and wait for them to boot up. 2. Login to one cluster node. Check the status of the following services: vdsm, ovirt-ha-agent, ovirt-ha-broker. The status should be all are running except ovirt-ha-agent is in 'locked' state and down. 3. Check 'hosted-engine --vm-status', this should result in a python stack trace (crash). 4. On all cluster nodes, connect the storage pool: 'hosted-engine --connect-storage'. Now, 'hosted-engine --vm-status' runs and reports 'up to date: False' and 'unknown-stale-data' for all nodes. 5. On all cluster nodes, start the 'ovirt-ha-agent' service: 'service ovirt-ha-agent start' 6. Wait a few minutes for the ha-broker and the agent to collect the cluster state. 7. Login to one cluster node. Check 'hosted-engine --vm-status' until you have cluster nodes 'status-up-to-date: True' and 'score: 2400' 8. If the cluster was shutdown by yourself and in global maintenance, remove the maintenance mode with 'hosted-engine --set-maintenance --mode=3Dnone'. Now, the system should do a FSM reinitialize and start the HostedEngine by itself.=C2=B9 If it was not in maintenance (eg. power fail) the engine should be started as soon as one host gets a score of 2400. Part C: STARTING A SINGLE NODE Prerequisite: oVirt HE cluster up, HostedEngine running. One ha node was taken offline by local maintenance in oVirt and rebooted. 1. Follow steps 1-5 of Part B 2. In oVirt, navigate to Cluster, Hosts and activate the node previously in maintenance. --- 1 I observed the following things: * If you use the command 'hosted-engine --vm-shutdown' instead of loging in to the ovirt HE and do a local shutdown, the Default Data Center is set to non - responsive and being Contented after the reboot. I highly suspect an unclean shutdown by running the command. Further, it waits about two min. with the shutdown. * If you use the command 'hosted-engine --vm-start' on a cluster in global maintenance, wait for successful start ({'health': 'good', 'vm': 'up', 'detail': 'up'}) and remove the maintenance status, the engine gets restarted once. By removing the maintenance first and letting ha-agent do the work, the engine is not restarted. Cheers, Daniel --=20 Daniel Helgenberger=20 m box bewegtbild GmbH=20 P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19=20 D-10115 BERLIN=20 www.m-box.de www.monkeymen.tv=20 Gesch=C3=A4ftsf=C3=BChrer: Martin Retschitzegger / Michaela G=C3=B6llner Handeslregister: Amtsgericht Charlottenburg / HRB 112767=20 --=-QguCZheGA74Z2DLSRg8u Content-Type: application/x-pkcs7-signature; name="smime.p7s" Content-Disposition: attachment; filename="smime.p7s" Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIINtjCCBBYw ggL+oAMCAQICCwQAAAAAAS9O4S9SMA0GCSqGSIb3DQEBBQUAMFcxCzAJBgNVBAYTAkJFMRkwFwYD VQQKExBHbG9iYWxTaWduIG52LXNhMRAwDgYDVQQLEwdSb290IENBMRswGQYDVQQDExJHbG9iYWxT aWduIFJvb3QgQ0EwHhcNMTEwNDEzMTAwMDAwWhcNMTkwNDEzMTAwMDAwWjBUMQswCQYDVQQGEwJC RTEZMBcGA1UEChMQR2xvYmFsU2lnbiBudi1zYTEqMCgGA1UEAxMhR2xvYmFsU2lnbiBQZXJzb25h bFNpZ24gMiBDQSAtIEcyMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAwWtB+TXs+BJ9 3SJRaV+3uRNGJ3cUO+MTgW8+5HQXfgy19CzkDI1T1NwwICi/bo4R/mYR5FEWx91//eE0ElC/89iY 7GkL0tDasmVx4TOXnrqrsziUcxEPPqHRE8x4NhtBK7+8o0nsMIJMA1gyZ2FA5To2Ew1BBuvovvDJ +Nua3qOCNBNu+8A+eNpJlVnlu/qB7+XWaPXtUMlsIikxD+gREFVUgYE4VzBuLa2kkg0VLd09XkE2 ceRDm6YgRATuDk6ogUyX4OLxCGIJF8yi6Z37M0wemDA6Uff0EuqdwDQd5HwG/rernUjt1grLdAxq 8BwywRRg0eFHmE+ShhpyO3Fi+wIDAQABo4HlMIHiMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8E CDAGAQH/AgEAMB0GA1UdDgQWBBQ/FdJtfC/nMZ5DCgaolGwsO8XuZTBHBgNVHSAEQDA+MDwGBFUd IAAwNDAyBggrBgEFBQcCARYmaHR0cHM6Ly93d3cuZ2xvYmFsc2lnbi5jb20vcmVwb3NpdG9yeS8w MwYDVR0fBCwwKjAooCagJIYiaHR0cDovL2NybC5nbG9iYWxzaWduLm5ldC9yb290LmNybDAfBgNV HSMEGDAWgBRge2YaRQ2XyolQL30EzTSo//z9SzANBgkqhkiG9w0BAQUFAAOCAQEAQ3N5zKTMSTED HGFAgd/gu91Kb8AxPHgjq+7dhf7mkCinMqqrLai2XOrz8CP63BPaAx7oGOUBI0MyASBGk5zej9L3 oHtiF2BL01m1sBnT8rQxT2CJd/+jqjUl0p2ew8T3HSyatrsooGvDwf00yCB2JHTNvtQxNO8t6x/+ 048A1Q+0i7uf0nTnyrJLjD04zhL89ytetZspltOpJVYbmwiFjq6PxsdUNthUDme/9pOLmKDnQU0p W/JEwLs2TYCBNKwdgSGAk8/z+s2SCltKIG0Uh5U6t6j7JPuwNP/znImwMrlHDJ1YpW0rkF2PGraV CgDBf9dOB+IIpnwHfIi+LD+eITCCBMowggOyoAMCAQICEQCWaWbA3qWpL+Qmn6I16DynMA0GCSqG SIb3DQEBBQUAMFQxCzAJBgNVBAYTAkJFMRkwFwYDVQQKExBHbG9iYWxTaWduIG52LXNhMSowKAYD VQQDEyFHbG9iYWxTaWduIFBlcnNvbmFsU2lnbiAyIENBIC0gRzIwHhcNMTMwODI3MTY1NzU4WhcN MTYwODI3MTY1NzU4WjBYMQswCQYDVQQGEwJERTEcMBoGA1UEAxMTRGFuaWVsIEhlbGdlbmJlcmdl cjErMCkGCSqGSIb3DQEJARYcZGFuaWVsLmhlbGdlbmJlcmdlckBtLWJveC5kZTCCASIwDQYJKoZI hvcNAQEBBQADggEPADCCAQoCggEBAM4BQ5vPknk1OGLd1qKSUIKmQLrjccjJcYj7qtAtA+fNYKF8 9p1VY4UwiFcF9jKlmA9Q8o8tYSx16LYYFoGWokNRAeKFXZiBZiHyI0ekpEfxo8N5cTMCcxKcSYWV 8sqzmBPCoMNpmiVoC8ec8Nv5SqXH34VVtDmNLfiVlsTyomBXAJkJ2/n5XqJzPLFGWWREtPLkVVS+ u426vt/hNsQi5akNoidYeXo98JcrmeApFJ3zB2KxvMziHx8LD4q1gAl9NumtX5YLbCpdWL9AkWdX Oaro3D9zj6Q6LyGwa/UQUrZdg3BXc07hjHZn6d9vet1SzpbyqQpTzM63yXiX1meEMlMCAwEAAaOC AZEwggGNMA4GA1UdDwEB/wQEAwIFoDBMBgNVHSAERTBDMEEGCSsGAQQBoDIBKDA0MDIGCCsGAQUF BwIBFiZodHRwczovL3d3dy5nbG9iYWxzaWduLmNvbS9yZXBvc2l0b3J5LzAnBgNVHREEIDAegRxk YW5pZWwuaGVsZ2VuYmVyZ2VyQG0tYm94LmRlMAkGA1UdEwQCMAAwHQYDVR0lBBYwFAYIKwYBBQUH AwIGCCsGAQUFBwMEMEMGA1UdHwQ8MDowOKA2oDSGMmh0dHA6Ly9jcmwuZ2xvYmFsc2lnbi5jb20v Z3MvZ3NwZXJzb25hbHNpZ24yZzIuY3JsMFUGCCsGAQUFBwEBBEkwRzBFBggrBgEFBQcwAoY5aHR0 cDovL3NlY3VyZS5nbG9iYWxzaWduLmNvbS9jYWNlcnQvZ3NwZXJzb25hbHNpZ24yZzIuY3J0MB0G A1UdDgQWBBS8NFA/upd+Wipw2nj8RD/Ct+R2GTAfBgNVHSMEGDAWgBQ/FdJtfC/nMZ5DCgaolGws O8XuZTANBgkqhkiG9w0BAQUFAAOCAQEAXVTpu4fhOLETAW0zdbQiIwBIMZgeVNJnWV3GsMxByycU 63P+WBQTBl9qj47vHLmVdeF7MzH0QSXZSc9Tnfr6CYIImpyIZxRAGpAsWmtZf3JieRA0+j4GQJF2 zAea1NXYXoG9+ZSSZHBSxKUdrRdVdE320nuVGTT2HjEI2LEYbOvaXyi6HhpuHUiyu4LD0+RIT3fi T8jUiKKLTsApTD+Ak8SLF0IESOSA6htirv69mDDC7Klg9dT7QBPO7dpoKIUOldV3VhahndVfsDff KD7pkUUvG5XftYEQOxlWDJzuTBeqf/4hxXMtzFU9OaI6oKJjLfr6B+XBc6xwOtc/NMWmejCCBMow ggOyoAMCAQICEQCWaWbA3qWpL+Qmn6I16DynMA0GCSqGSIb3DQEBBQUAMFQxCzAJBgNVBAYTAkJF MRkwFwYDVQQKExBHbG9iYWxTaWduIG52LXNhMSowKAYDVQQDEyFHbG9iYWxTaWduIFBlcnNvbmFs U2lnbiAyIENBIC0gRzIwHhcNMTMwODI3MTY1NzU4WhcNMTYwODI3MTY1NzU4WjBYMQswCQYDVQQG EwJERTEcMBoGA1UEAxMTRGFuaWVsIEhlbGdlbmJlcmdlcjErMCkGCSqGSIb3DQEJARYcZGFuaWVs LmhlbGdlbmJlcmdlckBtLWJveC5kZTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAM4B Q5vPknk1OGLd1qKSUIKmQLrjccjJcYj7qtAtA+fNYKF89p1VY4UwiFcF9jKlmA9Q8o8tYSx16LYY FoGWokNRAeKFXZiBZiHyI0ekpEfxo8N5cTMCcxKcSYWV8sqzmBPCoMNpmiVoC8ec8Nv5SqXH34VV tDmNLfiVlsTyomBXAJkJ2/n5XqJzPLFGWWREtPLkVVS+u426vt/hNsQi5akNoidYeXo98JcrmeAp FJ3zB2KxvMziHx8LD4q1gAl9NumtX5YLbCpdWL9AkWdXOaro3D9zj6Q6LyGwa/UQUrZdg3BXc07h jHZn6d9vet1SzpbyqQpTzM63yXiX1meEMlMCAwEAAaOCAZEwggGNMA4GA1UdDwEB/wQEAwIFoDBM BgNVHSAERTBDMEEGCSsGAQQBoDIBKDA0MDIGCCsGAQUFBwIBFiZodHRwczovL3d3dy5nbG9iYWxz aWduLmNvbS9yZXBvc2l0b3J5LzAnBgNVHREEIDAegRxkYW5pZWwuaGVsZ2VuYmVyZ2VyQG0tYm94 LmRlMAkGA1UdEwQCMAAwHQYDVR0lBBYwFAYIKwYBBQUHAwIGCCsGAQUFBwMEMEMGA1UdHwQ8MDow OKA2oDSGMmh0dHA6Ly9jcmwuZ2xvYmFsc2lnbi5jb20vZ3MvZ3NwZXJzb25hbHNpZ24yZzIuY3Js MFUGCCsGAQUFBwEBBEkwRzBFBggrBgEFBQcwAoY5aHR0cDovL3NlY3VyZS5nbG9iYWxzaWduLmNv bS9jYWNlcnQvZ3NwZXJzb25hbHNpZ24yZzIuY3J0MB0GA1UdDgQWBBS8NFA/upd+Wipw2nj8RD/C t+R2GTAfBgNVHSMEGDAWgBQ/FdJtfC/nMZ5DCgaolGwsO8XuZTANBgkqhkiG9w0BAQUFAAOCAQEA XVTpu4fhOLETAW0zdbQiIwBIMZgeVNJnWV3GsMxByycU63P+WBQTBl9qj47vHLmVdeF7MzH0QSXZ Sc9Tnfr6CYIImpyIZxRAGpAsWmtZf3JieRA0+j4GQJF2zAea1NXYXoG9+ZSSZHBSxKUdrRdVdE32 0nuVGTT2HjEI2LEYbOvaXyi6HhpuHUiyu4LD0+RIT3fiT8jUiKKLTsApTD+Ak8SLF0IESOSA6hti rv69mDDC7Klg9dT7QBPO7dpoKIUOldV3VhahndVfsDffKD7pkUUvG5XftYEQOxlWDJzuTBeqf/4h xXMtzFU9OaI6oKJjLfr6B+XBc6xwOtc/NMWmejGCAucwggLjAgEBMGkwVDELMAkGA1UEBhMCQkUx GTAXBgNVBAoTEEdsb2JhbFNpZ24gbnYtc2ExKjAoBgNVBAMTIUdsb2JhbFNpZ24gUGVyc29uYWxT aWduIDIgQ0EgLSBHMgIRAJZpZsDepakv5CafojXoPKcwCQYFKw4DAhoFAKCCAVMwGAYJKoZIhvcN AQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTQwNDI1MTMwODQ2WjAjBgkqhkiG9w0B CQQxFgQUPZKN8fiS/Sr1hKyYfYM5DfWHbCAweAYJKwYBBAGCNxAEMWswaTBUMQswCQYDVQQGEwJC RTEZMBcGA1UEChMQR2xvYmFsU2lnbiBudi1zYTEqMCgGA1UEAxMhR2xvYmFsU2lnbiBQZXJzb25h bFNpZ24gMiBDQSAtIEcyAhEAlmlmwN6lqS/kJp+iNeg8pzB6BgsqhkiG9w0BCRACCzFroGkwVDEL MAkGA1UEBhMCQkUxGTAXBgNVBAoTEEdsb2JhbFNpZ24gbnYtc2ExKjAoBgNVBAMTIUdsb2JhbFNp Z24gUGVyc29uYWxTaWduIDIgQ0EgLSBHMgIRAJZpZsDepakv5CafojXoPKcwDQYJKoZIhvcNAQEB BQAEggEAV/ihWVJiHBw/mUwk8C86CjFcft3O8f0PG7n2WX2cbMMX/gbGa0SffzBGYoKmQw8SvA8l C0VdCYwcHzaeuCYzEOJIbPv61EjUVQpxigoFQJ+ApnR8oZajNooVIDZixiYvoQ8ph/Tyui6nBw1B cFOTafv6ZBIUiJIu6OcfxPR0QJONi/OJqSeiXEwR9RzKVTKCruPTrmbx7KuvWPANDY3A4dxoHjTw eHwgoIr5L0XsCxkgsnRqFe0hOS/OzuJHhUt58ce82s3n0AR9np/Z9tkhHHQ4eH0TVgQ4j40CeSV8 fQXQoQwhuVkqdoQFr2tlNuiyXb322MVf3LCg7bYTeakA8QAAAAAAAA== --=-QguCZheGA74Z2DLSRg8u--

HI: Thanks. great work . Why not update this to wiki page? 2014-04-25 21:08 GMT+08:00 Daniel Helgenberger <daniel.helgenberger@m-box.de
:
Hello ovirt-users,
after playing around with my ovirt 3.4 hosted engine two node HA cluster I have devised a procedure on how to restart the whole cluster after a power loss / normal shutdown. This assumes all HA-Nodes have been taken offline. This also applies partly to rebooted HA nodes.
Please feel free do ask questions and/or comment on improvements. Most of the things should be obsoleted by future updates anyway.
Note 1: The problem IMHO seems to be the non connected nfs storage domain, resulting in the HA-Agent crash / hang. The ha-broker service should be up and running all the time. Please check this.
Note 2: My setup consists of two nodes; 'all nodes' means the task has to be performed on every node HA node in the cluster.
Node 3: By 'Login' I mean SSH or local access.
Part A: SHUTDOWN THE CLUSTER Prerequisite: oVirt HE cluster running, should be taken offline for maintenance: 1. In oVirt, shutdown all VM's except HostedEngine. 2. Login to one cluster node and run 'hosted-engine --set-maintenance --mode=global' to put the cluster into global maintenance 3. Login to ovirt engine VM and shut it down with 'shutdown -h now' 4. Login to one cluster node and run 'hosted-engine --vm-status' to check if the engine is really down. 5. Shutdown all HA nodes subsequently.
Part B: STARTING THE CLUSTER Prerequisite: oVirt HE cluster down, NFS storage server running and exporting the vdsm share. 1. Start all nodes and wait for them to boot up. 2. Login to one cluster node. Check the status of the following services: vdsm, ovirt-ha-agent, ovirt-ha-broker. The status should be all are running except ovirt-ha-agent is in 'locked' state and down. 3. Check 'hosted-engine --vm-status', this should result in a python stack trace (crash). 4. On all cluster nodes, connect the storage pool: 'hosted-engine --connect-storage'. Now, 'hosted-engine --vm-status' runs and reports 'up to date: False' and 'unknown-stale-data' for all nodes. 5. On all cluster nodes, start the 'ovirt-ha-agent' service: 'service ovirt-ha-agent start' 6. Wait a few minutes for the ha-broker and the agent to collect the cluster state. 7. Login to one cluster node. Check 'hosted-engine --vm-status' until you have cluster nodes 'status-up-to-date: True' and 'score: 2400' 8. If the cluster was shutdown by yourself and in global maintenance, remove the maintenance mode with 'hosted-engine --set-maintenance --mode=none'. Now, the system should do a FSM reinitialize and start the HostedEngine by itself.¹ If it was not in maintenance (eg. power fail) the engine should be started as soon as one host gets a score of 2400.
Part C: STARTING A SINGLE NODE Prerequisite: oVirt HE cluster up, HostedEngine running. One ha node was taken offline by local maintenance in oVirt and rebooted. 1. Follow steps 1-5 of Part B 2. In oVirt, navigate to Cluster, Hosts and activate the node previously in maintenance.
--- 1 I observed the following things: * If you use the command 'hosted-engine --vm-shutdown' instead of loging in to the ovirt HE and do a local shutdown, the Default Data Center is set to non - responsive and being Contented after the reboot. I highly suspect an unclean shutdown by running the command. Further, it waits about two min. with the shutdown. * If you use the command 'hosted-engine --vm-start' on a cluster in global maintenance, wait for successful start ({'health': 'good', 'vm': 'up', 'detail': 'up'}) and remove the maintenance status, the engine gets restarted once. By removing the maintenance first and letting ha-agent do the work, the engine is not restarted.
Cheers, Daniel --
Daniel Helgenberger m box bewegtbild GmbH
P: +49/30/2408781-22 F: +49/30/2408781-10
ACKERSTR. 19 D-10115 BERLIN
www.m-box.de www.monkeymen.tv
Geschäftsführer: Martin Retschitzegger / Michaela Göllner Handeslregister: Amtsgericht Charlottenburg / HRB 112767
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- 独立之思想,自由之精神。 --陈寅恪
participants (2)
-
Daniel Helgenberger
-
适兕