On Fri, Mar 27, 2020 at 9:07 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
On Wed, Mar 25, 2020 at 2:49 AM Gianluca Cecchi
<gianluca.cecchi(a)gmail.com> wrote:
>
> On Wed, Mar 25, 2020 at 1:16 AM Nir Soffer <nsoffer(a)redhat.com> wrote:
>>
>>
>>
>> OK, found it - this issue is
>>
https://bugzilla.redhat.com/1609029
>>
>> Simone provided this to solve the issue:
>>
https://github.com/oVirt/ovirt-ansible-shutdown-env/blob/master/README.md
>>
>> Nir
>>
>
> Ok, I will try the role provided by Simone and Sandro with my 4.3.9
single HCI host and report.
Looking at the bug comments, I'm not sure this ansible script address
the issues you reported. Please
file a bug if you still see these issues when using the script.
We may need to solve this in vdsm-tool, adding an easy way to stop the
spm and disconnect from
storage cleanly. When we have such way the ansible script can use it.
Nir
I would like to come back to this item.
I tried both on a physical environment and on a nested environment, both of
them composed of single host HCI 4.3.9 with Gluster
The playbook has to be executed on the engine
The high level steps for the role, and some of them don't apply to my env,
composed of single host, are:
- clean shutdown of all VMs except engine (eventual errors ignored)
- forced shutdown of all VMs with status != down, except engine
- shutdown (fence) of non hosted engine hosts configured with power mgmt,
one by one
- shutdown (in parallel, asynchronously, fire and forget) remaining non
hosted engine hosts via "ssh shutdown -h now" (when no qemu-kvm process
present any more on them) (eventual errors ignored)
- set global maintenance
Note: This task is run multiple times, because it is run on every hosted
engine host, while it shouldn't be necessary.... but it doesn't hurt...
(eventual errors ignored... but result is registered to be checked after)
- fail the job and stop if none of the set global maintenance commands
executed on hosted engine hosts succeeded
- shutdown (in parallel, asynchronously, fire and forget) the hosted engine
hosts via ssh
For the hosts without the engine VM running, the command is, without
waiting any more:
sanlock client shutdown -f 1 ; shutdown -h now
For the host with the engine VM running on top of it, the command is the
same, but it is waited that engine vm status is not up (using
"hosted-engine --vm-status" command)
- shutdown -h now on the engine (with async mode, fire and forget)
Snippet for hosted engine hosts shutdown:
- name: Shutdown of HE hosts
command: >-
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no
-i /etc/pki/ovirt-engine/keys/engine_id_rsa -p {{ item.ssh.port
}}
-t root@{{ item.address }} '{{ he_shutdown_cmd }}'
async: 1000
poll: 0
with_items:
- "{{ he_hosts }}"
where the he_shutdown_cmd var is defined as:
he_shutdown_cmd: >-
while hosted-engine --vm-status | grep "\"vm\":
\"up\""
/dev/null;
do sleep 5;
done;
sanlock client shutdown -f 1;
shutdown -h now
Snippet for the Engine VM shutdown
- name: Shutdown engine host/VM
command: shutdown -h now
async: 1000
poll: 0
In my case running the playbook I get this in final lines of output:
TASK [ovirt.shutdown_env : Shutdown engine host/VM]
***************************************************************************
changed: [localhost]
TASK [ovirt.shutdown_env : Power-on IPMI configured hosts]
********************************************************************
TASK [ovirt.shutdown_env : Unset global maintenance mode]
*********************************************************************
Connection to
novengine.example.net closed by remote host.
Connection to
novengine.example.net closed.
In general the engine VM complete shutdown and then the host is able to
complete its own shutdown when the engine has powered off.
Anyway I see these gluster umount errors on console of host:
https://drive.google.com/file/d/1uMGkPa6eyJy7RVNC2K5G66zCpyzeeuBo/view?us...
I have tried several times on physical and nested environments and at least
one time it failed on both: the engine VM was always able to complete its
shutdown, while the host remained powered on, as if it didn't get the
shutdown command, or the "sanlock client shutdown -f 1 " command didn't
complete...
Here below an example where the engine stops at 13:00 and I see this in
host's /var/log/messages, with gluster notification errors every 5 minutes
(13:05, 13:10, ...):
Apr 6 13:00:22 ovirt systemd-machined: Machine qemu-1-HostedEngine
terminated.
Apr 6 13:00:22 ovirt firewalld[8536]: WARNING: COMMAND_FAILED:
'/usr/sbin/iptables -w10 -w -D libvirt-out -m physdev --physdev-is-bridged
--physdev-out vnet0 -g FP-vnet0' failed: iptables v1.4.21: goto 'FP-vnet0'
is not a chain#012#012Try `iptables -h' or 'iptables --help' for more
information.
...
Apr 6 13:00:22 ovirt firewalld[8536]: WARNING: COMMAND_FAILED:
'/usr/sbin/ebtables --concurrent -t nat -X libvirt-O-vnet0' failed: Chain
'libvirt-O-vnet0' doesn't exist.
Apr 6 13:00:22 ovirt vdsm[11038]: WARN File:
/var/lib/libvirt/qemu/channels/e4da7514-f020-4c26-b088-f870fd66f4e5.ovirt-guest-agent.0
already removed
Apr 6 13:00:22 ovirt vdsm[11038]: WARN Attempting to remove a non existing
network: ovirtmgmt/e4da7514-f020-4c26-b088-f870fd66f4e5
Apr 6 13:00:22 ovirt vdsm[11038]: WARN Attempting to remove a non existing
net user: ovirtmgmt/e4da7514-f020-4c26-b088-f870fd66f4e5
Apr 6 13:00:22 ovirt vdsm[11038]: WARN File:
/var/lib/libvirt/qemu/channels/e4da7514-f020-4c26-b088-f870fd66f4e5.org.qemu.guest_agent.0
already removed
Apr 6 13:00:22 ovirt vdsm[11038]: WARN File:
/var/run/ovirt-vmconsole-console/e4da7514-f020-4c26-b088-f870fd66f4e5.sock
already removed
Apr 6 13:01:01 ovirt systemd: Started Session 10 of user root.
Apr 6 13:05:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06
11:05:06.817078] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev]
0-engine-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:05:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06
11:05:06.817688] E [MSGID: 115067]
[server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-engine-server: 58066:
WRITEV 3 (f45e0558-11d6-46f0-a8cb-b44a8aa41cf6), client:
CTX_ID:f6d514db-475f-43fe-93c5-0092ead0cf6e-GRAPH_ID:0-PID:11359-HOST:ovirt.mydomain.local-PC_NAME:engine-client-0-RECON_NO:-0,
error-xlator: engine-posix [Invalid argument]
Apr 6 13:05:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06
11:05:19.418662] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev]
0-vmstore-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:05:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06
11:05:19.418715] E [MSGID: 115067]
[server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-vmstore-server: 42559:
WRITEV 2 (82008e82-b4cf-4ad1-869c-5dd63b12d8a5), client:
CTX_ID:c86a0177-7d35-4ff3-96e3-16e680b23256-GRAPH_ID:0-PID:18259-HOST:ovirt.mydomain.local-PC_NAME:vmstore-client-0-RECON_NO:-0,
error-xlator: vmstore-posix [Invalid argument]
Apr 6 13:05:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06
11:05:19.423491] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev]
0-data-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:05:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06
11:05:19.423532] E [MSGID: 115067]
[server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-data-server: 1497: WRITEV
1 (8ef85381-ad7b-4aa3-a845-f285a3faa0c8), client:
CTX_ID:7e82fb0d-6faf-47b4-a41c-f52cdd4cb667-GRAPH_ID:0-PID:18387-HOST:ovirt.mydomain.local-PC_NAME:data-client-0-RECON_NO:-0,
error-xlator: data-posix [Invalid argument]
Apr 6 13:10:01 ovirt systemd: Started Session 11 of user root.
Apr 6 13:10:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06
11:10:06.954644] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev]
0-engine-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:10:06 ovirt gluster_bricks-engine-engine[10651]: [2020-04-06
11:10:06.954865] E [MSGID: 115067]
[server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-engine-server: 61538:
WRITEV 3 (eb89ab39-5d73-4589-b4e1-5b76f5d3d16f), client:
CTX_ID:f6d514db-475f-43fe-93c5-0092ead0cf6e-GRAPH_ID:0-PID:11359-HOST:ovirt.mydomain.local-PC_NAME:engine-client-0-RECON_NO:-0,
error-xlator: engine-posix [Invalid argument]
Apr 6 13:10:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06
11:10:19.493158] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev]
0-data-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:10:19 ovirt gluster_bricks-data-data[10620]: [2020-04-06
11:10:19.493214] E [MSGID: 115067]
[server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-data-server: 1974: WRITEV
1 (e3df4047-cd60-433a-872a-18771da260a0), client:
CTX_ID:7e82fb0d-6faf-47b4-a41c-f52cdd4cb667-GRAPH_ID:0-PID:18387-HOST:ovirt.mydomain.local-PC_NAME:data-client-0-RECON_NO:-0,
error-xlator: data-posix [Invalid argument]
Apr 6 13:10:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06
11:10:19.493870] E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev]
0-vmstore-posix: write failed: offset 0, [Invalid argument]
Apr 6 13:10:19 ovirt gluster_bricks-vmstore-vmstore[10687]: [2020-04-06
11:10:19.493917] E [MSGID: 115067]
[server-rpc-fops_v2.c:1373:server4_writev_cbk] 0-vmstore-server: 43039:
WRITEV 2 (3487143b-0c49-4c63-8564-18ce2fe8cf82), client:
CTX_ID:c86a0177-7d35-4ff3-96e3-16e680b23256-GRAPH_ID:0-PID:18259-HOST:ovirt.mydomain.local-PC_NAME:vmstore-client-0-RECON_NO:-0,
error-xlator: vmstore-posix [Invalid argument]
Could it be useful to insert the
/usr/share/glusterfs/scripts/stop-all-gluster-processes.sh command, as
suggested by Strahil, after the sanlock one, in case of GlusterFS domain?
Also one further note:
inside the role it is used the ovirt_host_facts module. I get this when I
use it and debug its content:
{
"msg": "The 'ovirt_host_facts' module has been
renamed to
'ovirt_host_info', and the renamed one no longer returns ansible_facts",
"version": "2.13"
}
So perhaps it should be considered to change and use ovirt_host_info
instead? Any plan?
Thanks for reading.
Gianluca