On Fri, Mar 27, 2020 at 9:07 PM Nir Soffer <nsoffer@redhat.com> wrote:

On Wed, Mar 25, 2020 at 2:49 AM Gianluca Cecchi
<gianluca.cecchi@gmail.com> wrote:
>
> On Wed, Mar 25, 2020 at 1:16 AM Nir Soffer <nsoffer@redhat.com> wrote:
>>
>>
>>
>> OK, found it - this issue is
>> https://bugzilla.redhat.com/1609029
>>
>> Simone provided this to solve the issue:
>> https://github.com/oVirt/ovirt-ansible-shutdown-env/blob/master/README.md
>>
>> Nir
>>
>
> Ok, I will try the role provided by Simone and Sandro with my 4.3.9 single HCI host and report.

Looking at the bug comments, I'm not sure this ansible script address
the issues you reported. Please
file a bug if you still see these issues when using the script.

We may need to solve this in vdsm-tool, adding an easy way to stop the
spm and disconnect from
storage cleanly. When we have such way the ansible script can use it.

Nir

I would like to come back to this item.

I tried both on a physical environment and on a nested environment, both of them composed of single host HCI 4.3.9 with Gluster

The playbook has to be executed on the engine

The high level steps for the role, and some of them don't apply to my env, composed of single host, are:

- clean shutdown of all VMs except engine (eventual errors ignored)

- forced shutdown of all VMs with status != down, except engine

- shutdown (fence) of non hosted engine hosts configured with power mgmt, one by one

- shutdown (in parallel, asynchronously, fire and forget) remaining non hosted engine hosts via "ssh shutdown -h now" (when no qemu-kvm process present any more on them) (eventual errors ignored)

- set global maintenance

Note: This task is run multiple times, because it is run on every hosted engine host, while it shouldn't be necessary.... but it doesn't hurt...

(eventual errors ignored... but result is registered to be checked after)

- fail the job and stop if none of the set global maintenance commands executed on hosted engine hosts succeeded

- shutdown (in parallel, asynchronously, fire and forget) the hosted engine hosts via ssh

For the hosts without the engine VM running, the command is, without waiting any more:

sanlock client shutdown -f 1 ; shutdown -h now

For the host with the engine VM running on top of it, the command is the same, but it is waited that engine vm status is not up (using "hosted-engine --vm-status" command)

- shutdown -h now on the engine (with async mode, fire and forget)

Snippet for hosted engine hosts shutdown:

- name: Shutdown of HE hosts
command: >-
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no
-i /etc/pki/ovirt-engine/keys/engine_id_rsa -p {{ item.ssh.port }}
-t root@{{ item.address }} '{{ he_shutdown_cmd }}'
async: 1000
poll: 0
with_items:
- "{{ he_hosts }}"

where the he_shutdown_cmd var is defined as:

he_shutdown_cmd: >-
while hosted-engine --vm-status | grep "\"vm\": \"up\"" >/dev/null;
do sleep 5;
done;
sanlock client shutdown -f 1;
shutdown -h now

Snippet for the Engine VM shutdown

- name: Shutdown engine host/VM
command: shutdown -h now
async: 1000
poll: 0

In my case running the playbook I get this in final lines of output:

TASK [ovirt.shutdown_env : Shutdown engine host/VM] ***************************************************************************
changed: [localhost]

TASK [ovirt.shutdown_env : Power-on IPMI configured hosts] ********************************************************************

TASK [ovirt.shutdown_env : Unset global maintenance mode] *********************************************************************
Connection to novengine.example.net closed by remote host.
Connection to novengine.example.net closed.

In general the engine VM complete shutdown and then the host is able to complete its own shutdown when the engine has powered off.

Anyway I see these gluster umount errors on console of host:

https://drive.google.com/file/d/1uMGkPa6eyJy7RVNC2K5G66zCpyzeeuBo/view?usp=sharing

I have tried several times on physical and nested environments and at least one time it failed on both: the engine VM was always able to complete its shutdown, while the host remained powered on, as if it didn't get the shutdown command, or the "sanlock client shutdown -f 1 " command didn't complete...

Here below an example where the engine stops at 13:00 and I see this in host's /var/log/messages, with gluster notification errors every 5 minutes (13:05, 13:10, ...):

Apr 6 13:00:22 ovirt systemd-machined: Machine qemu-1-HostedEngine terminated.
Apr 6 13:00:22 ovirt firewalld[8536]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-out -m physdev --physdev-is-bridged --physdev-out vnet0 -g FP-vnet0' failed: iptables v1.4.21: goto 'FP-vnet0' is not a chain#012#012Try `iptables -h' or 'iptables --help' for more information.
...
Apr 6 13:00:22 ovirt firewalld[8536]: WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist.
Apr 6 13:00:22 ovirt vdsm[11038]: WARN File: /var/lib/libvirt/qemu/channels/e4da7514-f020-4c26-b088-f870fd66f4e5.ovirt-guest-agent.0 already removed
Apr 6 13:00:22 ovirt vdsm[11038]: WARN Attempting to remove a non existing network: ovirtmgmt/e4da7514-f020-4c26-b088-f870fd66f4e5
Apr 6 13:00:22 ovirt vdsm[11038]: WARN Attempting to remove a non existing net user: ovirtmgmt/e4da7514-f020-4c26-b088-f870fd66f4e5
Apr 6 13:00:22 ovirt vdsm[11038]: WARN File: /var/lib/libvirt/qemu/channels/e4da7514-f020-4c26-b088-f870fd66f4e5.org.qemu.guest_agent.0 already removed
Apr 6 13:00:22 ovirt vdsm[11038]: WARN File: /var/run/ovirt-vmconsole-console/e4da7514-f020-4c26-b088-f870fd66f4e5.sock already removed
Apr 6 13:01:01 ovirt systemd: Started Session 10 of user root.
Apr 6 13:05:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06 11:05:06.817078] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-engine-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:05:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06 11:05:06.817688] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-engine-server: 58066: WRITEV 3 (f45e0558-11d6-46f0-a8cb-b44a8aa41cf6), client: CTX_ID:f6d514db-475f-43fe-93c5-0092ead0cf6e-GRAPH_ID:0-PID:11359-HOST:ovirt.mydomain.local-PC_NAME:engine-client-0-RECON_NO:-0, error-xlator: engine-posix [Invalid argument]
Apr 6 13:05:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06 11:05:19.418662] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-vmstore-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:05:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06 11:05:19.418715] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-vmstore-server: 42559: WRITEV 2 (82008e82-b4cf-4ad1-869c-5dd63b12d8a5), client: CTX_ID:c86a0177-7d35-4ff3-96e3-16e680b23256-GRAPH_ID:0-PID:18259-HOST:ovirt.mydomain.local-PC_NAME:vmstore-client-0-RECON_NO:-0, error-xlator: vmstore-posix [Invalid argument]
Apr 6 13:05:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06 11:05:19.423491] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-data-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:05:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06 11:05:19.423532] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-data-server: 1497: WRITEV 1 (8ef85381-ad7b-4aa3-a845-f285a3faa0c8), client: CTX_ID:7e82fb0d-6faf-47b4-a41c-f52cdd4cb667-GRAPH_ID:0-PID:18387-HOST:ovirt.mydomain.local-PC_NAME:data-client-0-RECON_NO:-0, error-xlator: data-posix [Invalid argument]
Apr 6 13:10:01 ovirt systemd: Started Session 11 of user root.
Apr 6 13:10:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06 11:10:06.954644] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-engine-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:10:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06 11:10:06.954865] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-engine-server: 61538: WRITEV 3 (eb89ab39-5d73-4589-b4e1-5b76f5d3d16f), client: CTX_ID:f6d514db-475f-43fe-93c5-0092ead0cf6e-GRAPH_ID:0-PID:11359-HOST:ovirt.mydomain.local-PC_NAME:engine-client-0-RECON_NO:-0, error-xlator: engine-posix [Invalid argument]
Apr 6 13:10:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06 11:10:19.493158] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-data-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:10:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06 11:10:19.493214] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-data-server: 1974: WRITEV 1 (e3df4047-cd60-433a-872a-18771da260a0), client: CTX_ID:7e82fb0d-6faf-47b4-a41c-f52cdd4cb667-GRAPH_ID:0-PID:18387-HOST:ovirt.mydomain.local-PC_NAME:data-client-0-RECON_NO:-0, error-xlator: data-posix [Invalid argument]
Apr 6 13:10:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06 11:10:19.493870] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-vmstore-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:10:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06 11:10:19.493917] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-vmstore-server: 43039: WRITEV 2 (3487143b-0c49-4c63-8564-18ce2fe8cf82), client: CTX_ID:c86a0177-7d35-4ff3-96e3-16e680b23256-GRAPH_ID:0-PID:18259-HOST:ovirt.mydomain.local-PC_NAME:vmstore-client-0-RECON_NO:-0, error-xlator: vmstore-posix [Invalid argument]

Could it be useful to insert the /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh command, as suggested by Strahil, after the sanlock one, in case of GlusterFS domain?

Also one further note:

inside the role it is used the ovirt_host_facts module. I get this when I use it and debug its content:

{
"msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts",
"version": "2.13"
}

So perhaps it should be considered to change and use ovirt_host_info instead? Any plan?

Thanks for reading.

Gianluca