
On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal <mzamazal@redhat.com> wrote:
Hi,
Michal has observed occasional OST failures in test_vdsm_recovery last days, which hadn't been seen before. When `systemctl stop vdsmd' is called (via Ansible) there, vdsmd (almost?) never finishes its shutdown within the 10 seconds timeout and gets then killed with SIGKILL. If this action is accompanied by "Job for vdsmd.service canceled." message then the test fails; otherwise OST continues normally.
The situation is reproducible by running OST basic-suite-master and making it artificially failing after test_vdsm_recovery. Then running `systemctl stop vdsmd' manually on the given OST host (can be done repeatedly, so it provides a good opportunity to examine the problem).
There are two problems there:
- "Job for vdsmd.service canceled." message that sometimes occurs after `systemctl stop vdsmd' and then the test fails. I don't know what it means and I can't identify any difference in journal between when the message occurs and when it doesn't.
This is likely caused by hosted engine agent discovering that hosted engine storage domain was deactivated and panicking. systemd will try to restart the hosted engine agent, and since it requires vdsm, systemd may start vdsm while it is being shut down, and you will get this "canceled" message. Basically if you want to stop vdsm you must stop other services requiring it on the host. If we want to avoid this craziness, we need to fix other services not to require vdsm. It is not possible to manage a service when other services require it.
- The fact that Vdsm doesn't stop within the timeout and must be killed. This doesn't happen in my normal oVirt installation. It apparently blocks in self.irs.prepareForShutdown() call from clientIF.py.
This is normal in the last 7-8 years. Vdsm storage shutdown is broken and is likely to get stuck during shutdown. However, It never caused trouble in the past.
Journal says:
systemd[1]: Stopping Virtual Desktop Server Manager... systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing. systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL systemd[1]: vdsmd.service: Failed with result 'timeout'. systemd[1]: Stopped Virtual Desktop Server Manager.
And vdsm.log (from a different run, sorry): ... 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Shutting down domain monitors (monitor:243)
This is shutdown flow - good...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
We stop the monitor with the shutdown flag - good...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
...
2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' (check:135)
We stopped checking the domain - good...
2021-11-12 07:09:32,867+0000 INFO (monitor/186180c) [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
Tearing down the storage domain - bad! This is a bug that was hidden for several years since tearing down a storage domain did not really tear down the storage. Now that we really tear down the storage, it will deactivate logical volumes and remove device mapper devices.
2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
This an attempt to deactivate a logical volume that should not happen during shutdown...
2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED: <err> = b' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> = 5 (commands:186)
And it fails, since the logical volume is in use.
What is interesting here is that ioprocess is started here rather than being killed as on clean vdsmd shutdowns.
This is normal, when systemd kill ioprocess, it try to restart. We need to improve iopprocess retry mechanism to wait before restarting, so it can learn that we are during a shutdown. Basically this is just noise since systemd will kill these ioprocess processes anyway when vdsm terminates. This should fix the wrong teardown during shutdown: https://gerrit.ovirt.org/c/vdsm/+/117616 Didi: this may also be related to: https://bugzilla.redhat.com/1986732 If vdsm is stopped during that flow. During shutdown, it will teardown the hosted engine storage domain. Tearing down the storage domain should happen only when vdsm is asked to deactivate and disconnect from the storage, not during shutdown. If you know how to reproduce the issue, please try with this fix and report it it makes a difference. Thanks for the detailed report! Nir