
Hello, I'm testing the single node HCI with ovirt-node-ng 4.3.9 iso. Very nice and many improvements over the last time I tried it. Good! I have a doubt related to shutdown procedure of the server. Here below my steps: - Shutdown all VMs (except engine) - Put into maintenance data and vmstore domains - Enable Global HA Maintenance - Shutdown engine - Shutdown hypervisor It seems that the last step doesn't end and I had to brutally power off the hypervisor. Here the screenshot regarding infinite failure in unmounting /gluster_bricks/engine https://drive.google.com/file/d/1ee0HG21XmYVA0t7LYo5hcFx1iLxZdZ-E/view?usp=s... What would be the right step to do before the final shutdown of hypervisor? Thanks, Gianluca

On March 24, 2020 1:28:37 PM GMT+02:00, Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, I'm testing the single node HCI with ovirt-node-ng 4.3.9 iso. Very nice and many improvements over the last time I tried it. Good!
I have a doubt related to shutdown procedure of the server. Here below my steps: - Shutdown all VMs (except engine) - Put into maintenance data and vmstore domains - Enable Global HA Maintenance - Shutdown engine - Shutdown hypervisor
It seems that the last step doesn't end and I had to brutally power off the hypervisor. Here the screenshot regarding infinite failure in unmounting /gluster_bricks/engine
https://drive.google.com/file/d/1ee0HG21XmYVA0t7LYo5hcFx1iLxZdZ-E/view?usp=s...
What would be the right step to do before the final shutdown of hypervisor? Thanks, Gianluca
You can kill gluster via: /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh Of course, you can create a systemd service to do that on power off Best Regards, Strahil Nikolov

On Tue, Mar 24, 2020 at 7:02 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
What would be the right step to do before the final shutdown of hypervisor? Thanks, Gianluca
You can kill gluster via: /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh
Of course, you can create a systemd service to do that on power off
Best Regards, Strahil Nikolov
I will try, thanks Strahil. BTW: I found similar question on Gluster mailing list and the same script was referred... But why has not been yet integrated this kind of flow? It should be quite normal to integrate shutdown of gluster processes when you shutdown the server.... Gianluca

On Tue, Mar 24, 2020 at 1:39 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, I'm testing the single node HCI with ovirt-node-ng 4.3.9 iso. Very nice and many improvements over the last time I tried it. Good!
I have a doubt related to shutdown procedure of the server. Here below my steps: - Shutdown all VMs (except engine) - Put into maintenance data and vmstore domains - Enable Global HA Maintenance - Shutdown engine
I think the missing part here is stopping the SPM (if running on this host), and disconnecting from storage. Both are done when you put a host to maintenance, but in hosted engine environment this is not possible from engine since engine runs on the storage you want to disconnect.
- Shutdown hypervisor
It seems that the last step doesn't end and I had to brutally power off the hypervisor. Here the screenshot regarding infinite failure in unmounting /gluster_bricks/engine
https://drive.google.com/file/d/1ee0HG21XmYVA0t7LYo5hcFx1iLxZdZ-E/view?usp=s...
What would be the right step to do before the final shutdown of hypervisor?
I think there is an ansible script to do what you need, or some other script. Simone, do you know where the clean shutdown script for HCI env? Nir

On Wed, Mar 25, 2020 at 12:36 AM Nir Soffer <nsoffer@redhat.com> wrote:
On Tue, Mar 24, 2020 at 1:39 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, I'm testing the single node HCI with ovirt-node-ng 4.3.9 iso. Very nice and many improvements over the last time I tried it. Good!
I have a doubt related to shutdown procedure of the server. Here below my steps: - Shutdown all VMs (except engine) - Put into maintenance data and vmstore domains - Enable Global HA Maintenance - Shutdown engine
I think the missing part here is stopping the SPM (if running on this host), and disconnecting from storage.
Yes, it is of course the SPM, because this is a single node HCI environment
Both are done when you put a host to maintenance, but in hosted engine environment this is not possible from engine since engine runs on the storage you want to disconnect.
In fact. From here my question and doubts
- Shutdown hypervisor
It seems that the last step doesn't end and I had to brutally power off the hypervisor. Here the screenshot regarding infinite failure in unmounting /gluster_bricks/engine
https://drive.google.com/file/d/1ee0HG21XmYVA0t7LYo5hcFx1iLxZdZ-E/view?usp=s...
What would be the right step to do before the final shutdown of
hypervisor?
I think there is an ansible script to do what you need, or some other script.
Simone, do you know where the clean shutdown script for HCI env?
Nir
Let's say that in a "standard production" HCI environment, with suppose 3 nodes, you have a planned maintenance and you have to shutdown all three nodes: the same applies when you have to shutdown the last node, but I imagine you have also to do something with gluster when you shutdown the second node because you have not quorum anymore, correct? Thanks, Gianluca

On Wed, Mar 25, 2020 at 1:49 AM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Wed, Mar 25, 2020 at 12:36 AM Nir Soffer <nsoffer@redhat.com> wrote:
On Tue, Mar 24, 2020 at 1:39 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, I'm testing the single node HCI with ovirt-node-ng 4.3.9 iso. Very nice and many improvements over the last time I tried it. Good!
I have a doubt related to shutdown procedure of the server. Here below my steps: - Shutdown all VMs (except engine) - Put into maintenance data and vmstore domains - Enable Global HA Maintenance - Shutdown engine
I think the missing part here is stopping the SPM (if running on this host), and disconnecting from storage.
Yes, it is of course the SPM, because this is a single node HCI environment
Both are done when you put a host to maintenance, but in hosted engine environment this is not possible from engine since engine runs on the storage you want to disconnect.
In fact. From here my question and doubts
Note that stopping the SPM and disconnecting from storage is not the same as stopping vdsm service. You need to use vdsm API to do this. This can be done with vdsm-tool command or with vdsm client library.
- Shutdown hypervisor
It seems that the last step doesn't end and I had to brutally power off the hypervisor. Here the screenshot regarding infinite failure in unmounting /gluster_bricks/engine
https://drive.google.com/file/d/1ee0HG21XmYVA0t7LYo5hcFx1iLxZdZ-E/view?usp=s...
What would be the right step to do before the final shutdown of hypervisor?
I think there is an ansible script to do what you need, or some other script.
Simone, do you know where the clean shutdown script for HCI env?
Nir
Let's say that in a "standard production" HCI environment, with suppose 3 nodes, you have a planned maintenance and you have to shutdown all three nodes: the same applies when you have to shutdown the last node, but I imagine you have also to do something with gluster when you shutdown the second node because you have not quorum anymore, correct?
The flow should be: 1. put host 1 to maintenance 2. put host 2 to maintenance At this point you have only host 3 connected to storage. 3. Stop the SPM 4. Disconnect from storage At this point there is no gluster mount on any host, so there is no quorum isssue. You should be able to shutdown the hosts at this point. I guess that gluster services handle shutdown gracefully like any service should. Nir

On Wed, Mar 25, 2020 at 1:36 AM Nir Soffer <nsoffer@redhat.com> wrote:
On Tue, Mar 24, 2020 at 1:39 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, I'm testing the single node HCI with ovirt-node-ng 4.3.9 iso. Very nice and many improvements over the last time I tried it. Good!
I have a doubt related to shutdown procedure of the server. Here below my steps: - Shutdown all VMs (except engine) - Put into maintenance data and vmstore domains - Enable Global HA Maintenance - Shutdown engine
I think the missing part here is stopping the SPM (if running on this host), and disconnecting from storage.
Both are done when you put a host to maintenance, but in hosted engine environment this is not possible from engine since engine runs on the storage you want to disconnect.
- Shutdown hypervisor
It seems that the last step doesn't end and I had to brutally power off the hypervisor. Here the screenshot regarding infinite failure in unmounting /gluster_bricks/engine
https://drive.google.com/file/d/1ee0HG21XmYVA0t7LYo5hcFx1iLxZdZ-E/view?usp=s...
What would be the right step to do before the final shutdown of hypervisor?
I think there is an ansible script to do what you need, or some other script.
Simone, do you know where the clean shutdown script for HCI env?
OK, found it - this issue is https://bugzilla.redhat.com/1609029 Simone provided this to solve the issue: https://github.com/oVirt/ovirt-ansible-shutdown-env/blob/master/README.md Nir

On Wed, Mar 25, 2020 at 1:16 AM Nir Soffer <nsoffer@redhat.com> wrote:
OK, found it - this issue is https://bugzilla.redhat.com/1609029
Simone provided this to solve the issue: https://github.com/oVirt/ovirt-ansible-shutdown-env/blob/master/README.md
Nir
Ok, I will try the role provided by Simone and Sandro with my 4.3.9 single HCI host and report. Thanks again Nir, Gianluca

On Wed, Mar 25, 2020 at 2:49 AM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Wed, Mar 25, 2020 at 1:16 AM Nir Soffer <nsoffer@redhat.com> wrote:
OK, found it - this issue is https://bugzilla.redhat.com/1609029
Simone provided this to solve the issue: https://github.com/oVirt/ovirt-ansible-shutdown-env/blob/master/README.md
Nir
Ok, I will try the role provided by Simone and Sandro with my 4.3.9 single HCI host and report.
Looking at the bug comments, I'm not sure this ansible script address the issues you reported. Please file a bug if you still see these issues when using the script. We may need to solve this in vdsm-tool, adding an easy way to stop the spm and disconnect from storage cleanly. When we have such way the ansible script can use it. Nir

On Fri, Mar 27, 2020 at 9:07 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Wed, Mar 25, 2020 at 2:49 AM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Wed, Mar 25, 2020 at 1:16 AM Nir Soffer <nsoffer@redhat.com> wrote:
OK, found it - this issue is https://bugzilla.redhat.com/1609029
Simone provided this to solve the issue:
https://github.com/oVirt/ovirt-ansible-shutdown-env/blob/master/README.md
Nir
Ok, I will try the role provided by Simone and Sandro with my 4.3.9 single HCI host and report.
Looking at the bug comments, I'm not sure this ansible script address the issues you reported. Please file a bug if you still see these issues when using the script.
We may need to solve this in vdsm-tool, adding an easy way to stop the spm and disconnect from storage cleanly. When we have such way the ansible script can use it.
Nir
I would like to come back to this item. I tried both on a physical environment and on a nested environment, both of them composed of single host HCI 4.3.9 with Gluster The playbook has to be executed on the engine The high level steps for the role, and some of them don't apply to my env, composed of single host, are: - clean shutdown of all VMs except engine (eventual errors ignored) - forced shutdown of all VMs with status != down, except engine - shutdown (fence) of non hosted engine hosts configured with power mgmt, one by one - shutdown (in parallel, asynchronously, fire and forget) remaining non hosted engine hosts via "ssh shutdown -h now" (when no qemu-kvm process present any more on them) (eventual errors ignored) - set global maintenance Note: This task is run multiple times, because it is run on every hosted engine host, while it shouldn't be necessary.... but it doesn't hurt... (eventual errors ignored... but result is registered to be checked after) - fail the job and stop if none of the set global maintenance commands executed on hosted engine hosts succeeded - shutdown (in parallel, asynchronously, fire and forget) the hosted engine hosts via ssh For the hosts without the engine VM running, the command is, without waiting any more: sanlock client shutdown -f 1 ; shutdown -h now For the host with the engine VM running on top of it, the command is the same, but it is waited that engine vm status is not up (using "hosted-engine --vm-status" command) - shutdown -h now on the engine (with async mode, fire and forget) Snippet for hosted engine hosts shutdown: - name: Shutdown of HE hosts command: >- ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /etc/pki/ovirt-engine/keys/engine_id_rsa -p {{ item.ssh.port }} -t root@{{ item.address }} '{{ he_shutdown_cmd }}' async: 1000 poll: 0 with_items: - "{{ he_hosts }}" where the he_shutdown_cmd var is defined as: he_shutdown_cmd: >- while hosted-engine --vm-status | grep "\"vm\": \"up\""
/dev/null; do sleep 5; done; sanlock client shutdown -f 1; shutdown -h now
Snippet for the Engine VM shutdown - name: Shutdown engine host/VM command: shutdown -h now async: 1000 poll: 0 In my case running the playbook I get this in final lines of output: TASK [ovirt.shutdown_env : Shutdown engine host/VM] *************************************************************************** changed: [localhost] TASK [ovirt.shutdown_env : Power-on IPMI configured hosts] ******************************************************************** TASK [ovirt.shutdown_env : Unset global maintenance mode] ********************************************************************* Connection to novengine.example.net closed by remote host. Connection to novengine.example.net closed. In general the engine VM complete shutdown and then the host is able to complete its own shutdown when the engine has powered off. Anyway I see these gluster umount errors on console of host: https://drive.google.com/file/d/1uMGkPa6eyJy7RVNC2K5G66zCpyzeeuBo/view?usp=s... I have tried several times on physical and nested environments and at least one time it failed on both: the engine VM was always able to complete its shutdown, while the host remained powered on, as if it didn't get the shutdown command, or the "sanlock client shutdown -f 1 " command didn't complete... Here below an example where the engine stops at 13:00 and I see this in host's /var/log/messages, with gluster notification errors every 5 minutes (13:05, 13:10, ...): Apr 6 13:00:22 ovirt systemd-machined: Machine qemu-1-HostedEngine terminated. Apr 6 13:00:22 ovirt firewalld[8536]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-out -m physdev --physdev-is-bridged --physdev-out vnet0 -g FP-vnet0' failed: iptables v1.4.21: goto 'FP-vnet0' is not a chain#012#012Try `iptables -h' or 'iptables --help' for more information. ... Apr 6 13:00:22 ovirt firewalld[8536]: WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. Apr 6 13:00:22 ovirt vdsm[11038]: WARN File: /var/lib/libvirt/qemu/channels/e4da7514-f020-4c26-b088-f870fd66f4e5.ovirt-guest-agent.0 already removed Apr 6 13:00:22 ovirt vdsm[11038]: WARN Attempting to remove a non existing network: ovirtmgmt/e4da7514-f020-4c26-b088-f870fd66f4e5 Apr 6 13:00:22 ovirt vdsm[11038]: WARN Attempting to remove a non existing net user: ovirtmgmt/e4da7514-f020-4c26-b088-f870fd66f4e5 Apr 6 13:00:22 ovirt vdsm[11038]: WARN File: /var/lib/libvirt/qemu/channels/e4da7514-f020-4c26-b088-f870fd66f4e5.org.qemu.guest_agent.0 already removed Apr 6 13:00:22 ovirt vdsm[11038]: WARN File: /var/run/ovirt-vmconsole-console/e4da7514-f020-4c26-b088-f870fd66f4e5.sock already removed Apr 6 13:01:01 ovirt systemd: Started Session 10 of user root. Apr 6 13:05:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06 11:05:06.817078] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-engine-posix: write failed: offset 0, [Invalid argument] Apr 6 13:05:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06 11:05:06.817688] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-engine-server: 58066: WRITEV 3 (f45e0558-11d6-46f0-a8cb-b44a8aa41cf6), client: CTX_ID:f6d514db-475f-43fe-93c5-0092ead0cf6e-GRAPH_ID:0-PID:11359-HOST:ovirt.mydomain.local-PC_NAME:engine-client-0-RECON_NO:-0, error-xlator: engine-posix [Invalid argument] Apr 6 13:05:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06 11:05:19.418662] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-vmstore-posix: write failed: offset 0, [Invalid argument] Apr 6 13:05:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06 11:05:19.418715] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-vmstore-server: 42559: WRITEV 2 (82008e82-b4cf-4ad1-869c-5dd63b12d8a5), client: CTX_ID:c86a0177-7d35-4ff3-96e3-16e680b23256-GRAPH_ID:0-PID:18259-HOST:ovirt.mydomain.local-PC_NAME:vmstore-client-0-RECON_NO:-0, error-xlator: vmstore-posix [Invalid argument] Apr 6 13:05:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06 11:05:19.423491] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-data-posix: write failed: offset 0, [Invalid argument] Apr 6 13:05:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06 11:05:19.423532] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-data-server: 1497: WRITEV 1 (8ef85381-ad7b-4aa3-a845-f285a3faa0c8), client: CTX_ID:7e82fb0d-6faf-47b4-a41c-f52cdd4cb667-GRAPH_ID:0-PID:18387-HOST:ovirt.mydomain.local-PC_NAME:data-client-0-RECON_NO:-0, error-xlator: data-posix [Invalid argument] Apr 6 13:10:01 ovirt systemd: Started Session 11 of user root. Apr 6 13:10:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06 11:10:06.954644] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-engine-posix: write failed: offset 0, [Invalid argument] Apr 6 13:10:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06 11:10:06.954865] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-engine-server: 61538: WRITEV 3 (eb89ab39-5d73-4589-b4e1-5b76f5d3d16f), client: CTX_ID:f6d514db-475f-43fe-93c5-0092ead0cf6e-GRAPH_ID:0-PID:11359-HOST:ovirt.mydomain.local-PC_NAME:engine-client-0-RECON_NO:-0, error-xlator: engine-posix [Invalid argument] Apr 6 13:10:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06 11:10:19.493158] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-data-posix: write failed: offset 0, [Invalid argument] Apr 6 13:10:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06 11:10:19.493214] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-data-server: 1974: WRITEV 1 (e3df4047-cd60-433a-872a-18771da260a0), client: CTX_ID:7e82fb0d-6faf-47b4-a41c-f52cdd4cb667-GRAPH_ID:0-PID:18387-HOST:ovirt.mydomain.local-PC_NAME:data-client-0-RECON_NO:-0, error-xlator: data-posix [Invalid argument] Apr 6 13:10:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06 11:10:19.493870] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-vmstore-posix: write failed: offset 0, [Invalid argument] Apr 6 13:10:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06 11:10:19.493917] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-vmstore-server: 43039: WRITEV 2 (3487143b-0c49-4c63-8564-18ce2fe8cf82), client: CTX_ID:c86a0177-7d35-4ff3-96e3-16e680b23256-GRAPH_ID:0-PID:18259-HOST:ovirt.mydomain.local-PC_NAME:vmstore-client-0-RECON_NO:-0, error-xlator: vmstore-posix [Invalid argument] Could it be useful to insert the /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh command, as suggested by Strahil, after the sanlock one, in case of GlusterFS domain? Also one further note: inside the role it is used the ovirt_host_facts module. I get this when I use it and debug its content: { "msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts", "version": "2.13" } So perhaps it should be considered to change and use ovirt_host_info instead? Any plan? Thanks for reading. Gianluca

On Wed, Apr 15, 2020 at 4:52 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote: [snip]
Snippet for hosted engine hosts shutdown:
- name: Shutdown of HE hosts command: >- ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /etc/pki/ovirt-engine/keys/engine_id_rsa -p {{ item.ssh.port }} -t root@{{ item.address }} '{{ he_shutdown_cmd }}' async: 1000 poll: 0 with_items: - "{{ he_hosts }}"
where the he_shutdown_cmd var is defined as:
he_shutdown_cmd: >- while hosted-engine --vm-status | grep "\"vm\": \"up\""
/dev/null; do sleep 5; done; sanlock client shutdown -f 1; shutdown -h now
Snippet for the Engine VM shutdown
- name: Shutdown engine host/VM command: shutdown -h now async: 1000 poll: 0
[snip]
Could it be useful to insert the /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh command, as suggested by Strahil, after the sanlock one, in case of GlusterFS domain?
Also one further note: inside the role it is used the ovirt_host_facts module. I get this when I use it and debug its content: { "msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts", "version": "2.13" }
So perhaps it should be considered to change and use ovirt_host_info instead? Any plan?
Thanks for reading.
Gianluca
Hello, I would like to keep on with this to have a better experience. Environment is physical 4.3.10 single host HCI that shows the same problems as above. So I modified the role file adding after sanlock shutdown the gluster stop script [root@ovengine tasks]# pwd /root/roles/ovirt.shutdown_env/tasks [root@ovengine tasks]# diff main.yml main.yml.orig 79d78 < /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh [root@ovengine tasks]# Now the poweroff completes, even if get these errors about stopping swap and gluster bricks filesystems: https://drive.google.com/file/d/1oh0sNC3ta5qP0KAcibTdDc5N_lpil8pS/view?usp=s... When I power on again the server in a second time, the environment starts ok in global maintenance and when I exit it, engine and gluster volumes start ok. Even if there is a 2-3 minutes delay (I already opened a thread about this) from the moment when you see the storage domains up in the web admin gui and the moment when they are truely up (file systems /rhev/data-center/mnt/glusterSD/... mounted). So if in the mean time you try to start a VM you get an error because disks not found... Comments? Gianluca
participants (3)
-
Gianluca Cecchi
-
Nir Soffer
-
Strahil Nikolov