<div dir="ltr">HI:<div>   Thanks. </div><div>   great work .</div><div><br></div><div>   Why not update this to wiki page?</div><div><br></div><div> </div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2014-04-25 21:08 GMT+08:00 Daniel Helgenberger <span dir="ltr">&lt;<a href="mailto:daniel.helgenberger@m-box.de" target="_blank">daniel.helgenberger@m-box.de</a>&gt;</span>:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello ovirt-users,<br>

<br>

after playing around with my ovirt 3.4 hosted engine two node HA cluster<br>

I have devised a procedure on how to restart the whole cluster after a<br>

power loss / normal shutdown. This assumes all HA-Nodes have been taken<br>

offline. This also applies partly to rebooted HA nodes.<br>

<br>

Please feel free do ask questions and/or comment on improvements. Most<br>

of the things should be obsoleted by future updates anyway.<br>

<br>

Note 1:<br>

The problem IMHO seems to be the non connected nfs storage domain,<br>

resulting in the HA-Agent crash / hang. The ha-broker service should be<br>

up and running all the time. Please check this.<br>

<br>

Note 2:<br>

My setup consists of two nodes; &#39;all nodes&#39; means the task has to be<br>

performed on every node HA node in the cluster.<br>

<br>

Node 3:<br>

By &#39;Login&#39; I mean SSH or local access.<br>

<br>

<br>

Part A: SHUTDOWN THE CLUSTER<br>

Prerequisite: oVirt HE cluster running, should be taken offline for<br>

maintenance:<br>

     1. In oVirt, shutdown all VM&#39;s except HostedEngine.<br>

     2. Login to one cluster node and run &#39;hosted-engine<br>

        --set-maintenance --mode=global&#39; to put the cluster into global<br>

        maintenance<br>

     3. Login to ovirt engine VM and shut it down with &#39;shutdown -h now&#39;<br>

     4. Login to one cluster node and run &#39;hosted-engine --vm-status&#39; to<br>

        check if the engine is really down.<br>

     5. Shutdown all HA nodes subsequently.<br>

<br>

<br>

Part B: STARTING THE CLUSTER<br>

Prerequisite: oVirt HE cluster down, NFS storage server running and<br>

exporting the vdsm share.<br>

     1. Start all nodes and wait for them to boot up.<br>

     2. Login to one cluster node. Check the status of the following<br>

        services: vdsm, ovirt-ha-agent, ovirt-ha-broker. The status<br>

        should be all are running except ovirt-ha-agent is in &#39;locked&#39;<br>

        state and down.<br>

     3. Check &#39;hosted-engine --vm-status&#39;, this should result in a<br>

        python stack trace (crash).<br>

     4. On all cluster nodes, connect the storage pool: &#39;hosted-engine<br>

        --connect-storage&#39;. Now, &#39;hosted-engine --vm-status&#39; runs and<br>

        reports &#39;up to date: False&#39; and &#39;unknown-stale-data&#39; for all<br>

        nodes.<br>

     5. On all cluster nodes, start the &#39;ovirt-ha-agent&#39; service:<br>

        &#39;service ovirt-ha-agent start&#39;<br>

     6. Wait a few minutes for the ha-broker and the agent to collect<br>

        the cluster state.<br>

     7. Login to one cluster node. Check &#39;hosted-engine --vm-status&#39;<br>

        until you have cluster nodes &#39;status-up-to-date: True&#39; and<br>

        &#39;score: 2400&#39;<br>

     8. If the cluster was shutdown by yourself and in global<br>

        maintenance, remove the maintenance mode with &#39;hosted-engine<br>

        --set-maintenance --mode=none&#39;. Now, the system should do a FSM<br>

        reinitialize and start the HostedEngine by itself.¹ If it was<br>

        not in maintenance (eg. power fail) the engine should be started<br>

        as soon as one host gets a score of 2400.<br>

<br>

<br>

Part C: STARTING A SINGLE NODE<br>

Prerequisite: oVirt HE cluster up, HostedEngine running. One ha node was<br>

taken offline by local maintenance in oVirt and rebooted.<br>

     1. Follow steps 1-5 of Part B<br>

     2. In oVirt, navigate to Cluster, Hosts and activate the node<br>

        previously in maintenance.<br>

<br>

---<br>

1 I observed the following things:<br>

      * If you use the command &#39;hosted-engine --vm-shutdown&#39; instead of<br>

        loging in to the ovirt HE and do a local shutdown, the Default<br>

        Data Center is set to non - responsive and being Contented after<br>

        the reboot. I highly suspect an unclean shutdown by running the<br>

        command. Further, it waits about two min. with the shutdown.<br>

      * If you use the command &#39;hosted-engine --vm-start&#39; on a cluster<br>

        in global maintenance, wait for successful start ({&#39;health&#39;:<br>

        &#39;good&#39;, &#39;vm&#39;: &#39;up&#39;, &#39;detail&#39;: &#39;up&#39;}) and remove the maintenance<br>

        status, the engine gets restarted once. By removing the<br>

        maintenance first and letting ha-agent do the work, the engine<br>

        is not restarted.<br>

<br>

<br>

Cheers,<br>

Daniel<br>

--<br>

<br>

Daniel Helgenberger<br>

m box bewegtbild GmbH<br>

<br>

P: +49/30/2408781-22<br>

F: +49/30/2408781-10<br>

<br>

ACKERSTR. 19<br>

D-10115 BERLIN<br>

<br>

<br>

<a href="http://www.m-box.de" target="_blank">www.m-box.de</a>  <a href="http://www.monkeymen.tv" target="_blank">www.monkeymen.tv</a><br>

<br>

Geschäftsführer: Martin Retschitzegger / Michaela Göllner<br>

Handeslregister: Amtsgericht Charlottenburg / HRB 112767<br>

<br>_______________________________________________<br>

Users mailing list<br>

<a href="mailto:Users@ovirt.org">Users@ovirt.org</a><br>

<a href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br>

<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>独立之思想，自由之精神。<br>                        --陈寅恪

</div>