Info about procedure to shutdown hosted engine VM

Gianluca Cecchi

7 Sep 2018 7 Sep '18

5:45 a.m.

Hello, I'm writing a workflow regarding operations to do in case of planned maintenance where one has to stop all hypervisors and so also hosted engine vm. At the moment I have imagined: - shutdown all VMs but Hosted Engine - put into maintenance and then shutdown all the hosts where Engine is not running, one by one - put environment in global maintenance Now the next step would be to shutdown Hosted Engine VM. As my workflow is for users not necessarily expert with Linux I was thinking what would be an alternate method in respect of direct connect via ssh as root user and run "shutdown -h now" on engine Is there anything I can do from gui of web admin portal or cockpit of the host where it is running? Can I for example select to shutdown engine vm from web admin UI with a certain delay in time so that I disconnect and wait? Or anything I can do from cockpit of the host? Of course I'm searching for a supported flow. Thanks in advance, Gianluca

Attachments:

attachment.html (text/html — 1.2 KB)

Show replies by date

Simone Tiraboschi

9 Sep 9 Sep

4:15 a.m.

On Fri, Sep 7, 2018 at 11:47 AM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:

...

Hello, I'm writing a workflow regarding operations to do in case of planned maintenance where one has to stop all hypervisors and so also hosted engine vm. At the moment I have imagined: - shutdown all VMs but Hosted Engine - put into maintenance and then shutdown all the hosts where Engine is not running, one by one - put environment in global maintenance

Now the next step would be to shutdown Hosted Engine VM.

As my workflow is for users not necessarily expert with Linux I was thinking what would be an alternate method in respect of direct connect via ssh as root user and run "shutdown -h now" on engine

Is there anything I can do from gui of web admin portal or cockpit of the host where it is running?

Can I for example select to shutdown engine vm from web admin UI with a certain delay in time so that I disconnect and wait? Or anything I can do from cockpit of the host?

Of course I'm searching for a supported flow.

Now we have a specific ansible role for that: https://github.com/oVirt/ovirt-ansible-shutdown-env

...

Thanks in advance, Gianluca

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/GU7JRF2ASFHQYS...

Gianluca Cecchi

11 Sep 11 Sep

11:27 a.m.

On Sun, Sep 9, 2018 at 10:15 AM Simone Tiraboschi <stirabos@redhat.com> wrote:

...

Now we have a specific ansible role for that: https://github.com/oVirt/ovirt-ansible-shutdown-env

Nice to see this Simone, thanks. I'm reading the info to try on an hosted engine setup and find this: " This role has be designed to be run only against the machine where ovirt-engine is running. " Does it mean that I have to run the ansible-playbook command from an external server and use as host in inventory the engine server, or does it mean that the ansible-playbook command is to be run from within the server where the ovirt-engine service is running and so keep intact the lines inside the smple yal file: " - name: oVirt shutdown environment hosts: localhost connection: local " Thanks, Gianluca

Simone Tiraboschi

12 Sep 12 Sep

4:03 a.m.

On Tue, Sep 11, 2018 at 5:28 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:

...

On Sun, Sep 9, 2018 at 10:15 AM Simone Tiraboschi <stirabos@redhat.com> wrote:

...
Now we have a specific ansible role for that: https://github.com/oVirt/ovirt-ansible-shutdown-env

Nice to see this Simone, thanks. I'm reading the info to try on an hosted engine setup and find this: " This role has be designed to be run only against the machine where ovirt-engine is running. " Does it mean that I have to run the ansible-playbook command from an external server and use as host in inventory the engine server, or does it mean that the ansible-playbook command is to be run from within the server where the ovirt-engine service is running and so keep intact the lines inside the smple yal file: " - name: oVirt shutdown environment hosts: localhost connection: local "

Both options are valid.

...

Thanks, Gianluca

Gianluca Cecchi

9:48 a.m.

On Wed, Sep 12, 2018 at 10:03 AM Simone Tiraboschi <stirabos@redhat.com> wrote:

...

...
Does it mean that I have to run the ansible-playbook command from an external server and use as host in inventory the engine server, or does it mean that the ansible-playbook command is to be run from within the server where the ovirt-engine service is running and so keep intact the lines inside the smple yal file: " - name: oVirt shutdown environment hosts: localhost connection: local "

Both options are valid.

Good! It seems it worked ok in shutdown mode (the default one) in a test hosted engine based 4.2.6 environment, where I have 2 hosts (both are hosted engine hosts), the hosted engine VM + 3 VMs Initially ovnode2 is both SPM and hosts the HostedEngine VM If I run the playbook from inside ovmgr42: [root@ovmgr42 tests]# ansible-playbook test.yml [WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all' PLAY [oVirt shutdown environment] ****************************************************************** TASK [oVirt.shutdown-env : Populate service facts] ************************************************* ok: [localhost] TASK [oVirt.shutdown-env : Enforce ovirt-engine machine] ******************************************* skipping: [localhost] TASK [oVirt.shutdown-env : Enforce ovirt-engine status] ******************************************** skipping: [localhost] TASK [oVirt.shutdown-env : Login to oVirt] ********************************************************* ok: [localhost] TASK [oVirt.shutdown-env : Get hosts] ************************************************************** ok: [localhost] TASK [oVirt.shutdown-env : set_fact] ***************************************************************ok: [localhost] TASK [oVirt.shutdown-env : Enforce global maintenance mode] **************************************** skipping: [localhost] TASK [oVirt.shutdown-env : Warn about HE global maintenace mode] *********************************** ok: [localhost] => { "msg": "HE global maintenance mode has been set; you have to exit it to get the engine VM started when needed\n" } TASK [oVirt.shutdown-env : Shutdown of HE hosts] *************************************************** changed: [localhost] => (item= . . . u'name': u'ovnode1', . . . u'spm': {u'priority': 5, u'status': u'none'}}) changed: [localhost] => (item= . . . u'name': u'ovnode2', . . . u'spm': {u'priority': 5, u'status': u'spm'}}) TASK [oVirt.shutdown-env : Shutdown engine host/VM] ************************************************ Connection to ovmgr42 closed by remote host. Connection to ovmgr42 closed. [g.cecchi@ope46 ~]$ At the end the 2 hosts (HP blades) are in power off state, as expected. ILO event log of ovnode1: Last Update Initial Update Count Description 09/12/2018 10:13 09/12/2018 10:13 1 Server power removed. ILO event log of ovnode2: Last Update Initial Update Count Description 09/12/2018 10:14 09/12/2018 10:14 1 Server power removed. Actually due to time settings, they are to be intended as 11:13 and 11:14 my local time In /var/log/libvirt/qemu/HostedEngine.log of node ovnode2 2018-09-11 17:04:16.388+0000: starting up libvirt version: 3.9.0, . . . hostname: ovnode2 ... 2018-09-12 09:11:29.641+0000: shutting down, reason=shutdown Actually we are at 11:11 local time For now I have then manually restarted all the env I began starting from ovnode2 (that was SPM and with HostedEngine during shutdown), keeping ovnode1 powered off, and it took some time because I got some messages like this (to be read bottom up) Host ovnode1 failed to recover. 9/12/18 2:30:21 PM Host ovnode1 is non responsive. 9/12/18 2:30:21 PM ... Host ovnode1 is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. 9/12/18 2:27:40 PM Failed to Reconstruct Master Domain for Data Center MYDC42. 9/12/18 2:27:34 PM VDSM ovnode2 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5af30d59-004c-02f2-01c9-0000000000b8, sdUUID=cbc308db-5468-4e6d-aabb-f9d133d05de2' 9/12/18 2:27:33 PM Invalid status on Data Center MYDC42. Setting status to Non Responsive. 9/12/18 2:27:27 PM ... ETL Service Started 9/12/18 2:26:27 PM With ovnode1 still powered off, if I try to start it from the gui I get: Trying to power on ovnode1 I get in events: Host ovnode1 became non responsive. Fence operation skipped as the system is still initializing and this is not a host where hosted engine was running on previously. 9/12/18 2:30:21 PM and as popup I get this "operation canceled" window: https://drive.google.com/file/d/1IWXASJHRylZR6ePWtGUcKiLbYjg__eNS/view?usp=s... What's the meaning? In the phrase "the system is still initializing and this is not a host where hosted engine was running" the term THIS to which host is referred? After some minutes I automatically get (to be read bottom up): Host ovnode1 power management was verified successfully. 9/12/18 2:40:47 PM Status of host ovnode1 was set to Up. 9/12/18 2:40:47 PM .. No faulty multipath paths on host ovnode1 9/12/18 2:40:46 PM Storage Pool Manager runs on Host ovnode2 (Address: ovnode2), Data Center MYDC42. 9/12/18 2:37:55 PM Reconstruct Master Domain for Data Center MYDC42 completed. 9/12/182:37:49 PM .. Host ovnode1 was started by SYSTEM. 9/12/18 2:32:37 PM Power management start of Host ovnode1 succeeded. 9/12/18 2:32:37 PM Executing power management status on Host ovnode1 using Proxy Host ovnode2 and Fence Agent ipmilan:172.16.1.52. 9/12/18 2:32:26 PM Power management start of Host ovnode1 initiated. 9/12/18 2:32:26 PM Auto fence for host ovnode1 was started. 9/12/18 2:32:26 PM Storage Domain ISCSI_2TB (Data Center MYDC42) was deactivated by system because it's not visible by any of the hosts. 9/12/18 2:32:22 PM .. Executing power management status on Host ovnode1 using Proxy Host ovnode2 and Fence Agent ipmilan:172.16.1.52. 9/12/18 2:32:19 PM Power management stop of Host ovnode1 initiated. 9/12/18 2:32:17 PM Executing power management status on Host ovnode1 using Proxy Host ovnode2 and Fence Agent ipmilan:172.16.1.52. 9/12/18 2:32:16 PM ... Host ovnode1 failed to recover. 9/12/18 2:30:21 PM Host ovnode1 is non responsive. 9/12/18 2:30:21 PM My questions are: - what if for some reason ovnode1 was not available during restart? Would have the system started the services anyway after some time in that case or could have it been a problem? - If I want to try to start the environment through ansible playbook I see that it seems I have to use "startup" tag, but it is not fully automated? " A startup mode is also available: in the startup mode the role will bring up all the IPMI configured hosts and it will unset the global maintenance mode if on an hosted-engine environment. The startup mode will be executed only if the 'startup' tag is applied; shutdown mode is the default. The startup mode requires the engine to be already up. " Is the last sentence referred to a non-hosted engine environment? Otherwise I don't understand "will unset the global maintenance mode if on an hosted-engine environment." Also with IPMI does it mean in general the power mgmt feature (in my case I have iLO and not ipmilan) or what? Where does it get the facts about hosts in hosted engine environment, as the engine is forcibly down if the hosted engine hosts are powered down? Thanks in advance for your time Gianluca

Simone Tiraboschi

10:15 a.m.

On Wed, Sep 12, 2018 at 3:49 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:

...

On Wed, Sep 12, 2018 at 10:03 AM Simone Tiraboschi <stirabos@redhat.com> wrote:

...
...
Does it mean that I have to run the ansible-playbook command from an external server and use as host in inventory the engine server, or does it mean that the ansible-playbook command is to be run from within the server where the ovirt-engine service is running and so keep intact the lines inside the smple yal file: " - name: oVirt shutdown environment hosts: localhost connection: local "

Both options are valid.

Good! It seems it worked ok in shutdown mode (the default one) in a test hosted engine based 4.2.6 environment, where I have 2 hosts (both are hosted engine hosts), the hosted engine VM + 3 VMs Initially ovnode2 is both SPM and hosts the HostedEngine VM If I run the playbook from inside ovmgr42:

[root@ovmgr42 tests]# ansible-playbook test.yml [WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'

PLAY [oVirt shutdown environment] ******************************************************************

TASK [oVirt.shutdown-env : Populate service facts] ************************************************* ok: [localhost]

TASK [oVirt.shutdown-env : Enforce ovirt-engine machine] ******************************************* skipping: [localhost]

TASK [oVirt.shutdown-env : Enforce ovirt-engine status] ******************************************** skipping: [localhost]

TASK [oVirt.shutdown-env : Login to oVirt] ********************************************************* ok: [localhost]

TASK [oVirt.shutdown-env : Get hosts] ************************************************************** ok: [localhost]

TASK [oVirt.shutdown-env : set_fact] ***************************************************************ok: [localhost]

TASK [oVirt.shutdown-env : Enforce global maintenance mode] **************************************** skipping: [localhost]

TASK [oVirt.shutdown-env : Warn about HE global maintenace mode] *********************************** ok: [localhost] => { "msg": "HE global maintenance mode has been set; you have to exit it to get the engine VM started when needed\n" }

TASK [oVirt.shutdown-env : Shutdown of HE hosts] *************************************************** changed: [localhost] => (item= . . . u'name': u'ovnode1', . . . u'spm': {u'priority': 5, u'status': u'none'}}) changed: [localhost] => (item= . . . u'name': u'ovnode2', . . . u'spm': {u'priority': 5, u'status': u'spm'}})

TASK [oVirt.shutdown-env : Shutdown engine host/VM] ************************************************ Connection to ovmgr42 closed by remote host. Connection to ovmgr42 closed. [g.cecchi@ope46 ~]$

At the end the 2 hosts (HP blades) are in power off state, as expected.

ILO event log of ovnode1: Last Update Initial Update Count Description 09/12/2018 10:13 09/12/2018 10:13 1 Server power removed.

ILO event log of ovnode2: Last Update Initial Update Count Description 09/12/2018 10:14 09/12/2018 10:14 1 Server power removed.

Actually due to time settings, they are to be intended as 11:13 and 11:14 my local time

In /var/log/libvirt/qemu/HostedEngine.log of node ovnode2

2018-09-11 17:04:16.388+0000: starting up libvirt version: 3.9.0, . . . hostname: ovnode2 ... 2018-09-12 09:11:29.641+0000: shutting down, reason=shutdown

Actually we are at 11:11 local time

For now I have then manually restarted all the env I began starting from ovnode2 (that was SPM and with HostedEngine during shutdown), keeping ovnode1 powered off, and it took some time because I got some messages like this (to be read bottom up)

Host ovnode1 failed to recover. 9/12/18 2:30:21 PM Host ovnode1 is non responsive. 9/12/18 2:30:21 PM ... Host ovnode1 is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. 9/12/18 2:27:40 PM Failed to Reconstruct Master Domain for Data Center MYDC42. 9/12/18 2:27:34 PM VDSM ovnode2 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5af30d59-004c-02f2-01c9-0000000000b8, sdUUID=cbc308db-5468-4e6d-aabb-f9d133d05de2' 9/12/18 2:27:33 PM Invalid status on Data Center MYDC42. Setting status to Non Responsive. 9/12/18 2:27:27 PM ... ETL Service Started 9/12/18 2:26:27 PM

With ovnode1 still powered off, if I try to start it from the gui I get:

Trying to power on ovnode1 I get in events: Host ovnode1 became non responsive. Fence operation skipped as the system is still initializing and this is not a host where hosted engine was running on previously. 9/12/18 2:30:21 PM

and as popup I get this "operation canceled" window:

https://drive.google.com/file/d/1IWXASJHRylZR6ePWtGUcKiLbYjg__eNS/view?usp=s...

What's the meaning? In the phrase "the system is still initializing and this is not a host where hosted engine was running" the term THIS to which host is referred? After some minutes I automatically get (to be read bottom up):

We are tracing ad discussing it here: https://bugzilla.redhat.com/show_bug.cgi?id=1609029 As you noticed after a few minutes everything comes back to up status but the startup phase is really confusing. We are working on patch to provide a smoother startup experience although I don't any concrete drawback of the current code.

...

Host ovnode1 power management was verified successfully. 9/12/18 2:40:47 PM Status of host ovnode1 was set to Up. 9/12/18 2:40:47 PM .. No faulty multipath paths on host ovnode1 9/12/18 2:40:46 PM Storage Pool Manager runs on Host ovnode2 (Address: ovnode2), Data Center MYDC42. 9/12/18 2:37:55 PM Reconstruct Master Domain for Data Center MYDC42 completed. 9/12/182:37:49 PM .. Host ovnode1 was started by SYSTEM. 9/12/18 2:32:37 PM Power management start of Host ovnode1 succeeded. 9/12/18 2:32:37 PM Executing power management status on Host ovnode1 using Proxy Host ovnode2 and Fence Agent ipmilan:172.16.1.52. 9/12/18 2:32:26 PM Power management start of Host ovnode1 initiated. 9/12/18 2:32:26 PM Auto fence for host ovnode1 was started. 9/12/18 2:32:26 PM Storage Domain ISCSI_2TB (Data Center MYDC42) was deactivated by system because it's not visible by any of the hosts. 9/12/18 2:32:22 PM .. Executing power management status on Host ovnode1 using Proxy Host ovnode2 and Fence Agent ipmilan:172.16.1.52. 9/12/18 2:32:19 PM Power management stop of Host ovnode1 initiated. 9/12/18 2:32:17 PM Executing power management status on Host ovnode1 using Proxy Host ovnode2 and Fence Agent ipmilan:172.16.1.52. 9/12/18 2:32:16 PM ... Host ovnode1 failed to recover. 9/12/18 2:30:21 PM Host ovnode1 is non responsive. 9/12/18 2:30:21 PM

My questions are:

- what if for some reason ovnode1 was not available during restart? Would have the system started the services anyway after some time in that case or could have it been a problem?

ovnode1 will be in non operation state until available. In the mean time the engine could elect a different SPM host and so on.

...

- If I want to try to start the environment through ansible playbook I see that it seems I have to use "startup" tag, but it is not fully automated?

As you can see from the playbook that role require an access to the engine host or VM but not to each managed host. This is required to fetch hosts list from the engine, use its power management capabilities, credentiuals and so on. No host details are required for playbook execution.

...

" A startup mode is also available: in the startup mode the role will bring up all the IPMI configured hosts and it will unset the global maintenance mode if on an hosted-engine environment. The startup mode will be executed only if the 'startup' tag is applied; shutdown mode is the default. The startup mode requires the engine to be already up. " Is the last sentence referred to a non-hosted engine environment?

No, the engine host should be manually powered on if physical or the at least one HE host (2 for the hyper converged case) should be powered on. Exiting global maintenance mode is up to the user as well.

...