[ovirt-devel] Re: OST: Vdsm: Occasional failures when stopping vdsmd

15 Nov 2021

      Nir Soffer <nsoffer@redhat.com> writes:
...
On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal <mzamazal@redhat.com> wrote:
...
Hi,
...
Michal has observed occasional OST failures in test_vdsm_recovery last
days, which hadn't been seen before.  When `systemctl stop vdsmd' is
called (via Ansible) there, vdsmd (almost?) never finishes its shutdown
within the 10 seconds timeout and gets then killed with SIGKILL.  If
this action is accompanied by "Job for vdsmd.service canceled." message
then the test fails; otherwise OST continues normally.
The situation is reproducible by running OST basic-suite-master and
making it artificially failing after test_vdsm_recovery.  Then running
`systemctl stop vdsmd' manually on the given OST host (can be done
repeatedly, so it provides a good opportunity to examine the problem).
There are two problems there:
- "Job for vdsmd.service canceled." message that sometimes occurs after
  `systemctl stop vdsmd' and then the test fails.  I don't know what it
  means and I can't identify any difference in journal between when the
  message occurs and when it doesn't.
This is likely caused by hosted engine agent discovering that hosted engine
storage domain was deactivated and panicking. systemd will try to restart
the hosted engine agent, and since it requires vdsm, systemd may start vdsm
while it is being shut down, and you will get this "canceled" message.
Basically if you want to stop vdsm you must stop other services requiring it
on the host.
If we want to avoid this craziness, we need to fix other services not to require
vdsm. It is not possible to manage a service when other services require it.
...
- The fact that Vdsm doesn't stop within the timeout and must be killed.
  This doesn't happen in my normal oVirt installation.  It apparently
  blocks in self.irs.prepareForShutdown() call from clientIF.py.
This is normal in the last 7-8 years. Vdsm storage shutdown is broken
and is likely
to get stuck during shutdown. However, It never caused trouble in the past.
...
Journal says:
systemd[1]: Stopping Virtual Desktop Server Manager...
    systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing.
    systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal SIGKILL.
    systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal SIGKILL.
    systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal SIGKILL.
    systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal SIGKILL.
    systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL
    systemd[1]: vdsmd.service: Failed with result 'timeout'.
    systemd[1]: Stopped Virtual Desktop Server Manager.
And vdsm.log (from a different run, sorry):
...
    2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor]
Shutting down domain monitors (monitor:243)
This is shutdown flow - good...
...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor]
Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True)
(monitor:268)
We stop the monitor with the shutdown flag - good...
...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor]
Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True)
(monitor:268)
...
...
2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e)
[storage.check] Stop checking
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata'
(check:135)
We stopped checking the domain - good...
...
2021-11-12 07:09:32,867+0000 INFO (monitor/186180c)
[storage.blocksd] Tearing down domain
186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
Tearing down the storage domain - bad!
This is a bug that was hidden for several years since tearing down a
storage domain
did not really tear down the storage. Now that we really tear down the
storage, it will
deactivate logical volumes and remove device mapper devices.
...
2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c)
[common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n
/sbin/lvm vgchange --config 'devices {
preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3
filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
"r|.*|"] hints="none" obtain_device_list_from_udev=0 } global {
locking_type=1 prioritise_write_locks=1 wait_for_locks=1
use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0
}' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None)
(commands:154)
This an attempt to deactivate a logical volume that should not happen
during shutdown...
...
2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c)
[common.commands] FAILED: <err> = b' Logical volume
186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate
volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open
logical volume(s)\n'; <rc> = 5 (commands:186)
And it fails, since the logical volume is in use.
...
What is interesting here is that ioprocess is started here rather than
  being killed as on clean vdsmd shutdowns.
This is normal, when systemd kill ioprocess, it try to restart. We need to
improve iopprocess retry mechanism to wait before restarting, so it can
learn that we are during a shutdown. Basically this is just noise since systemd
will kill these ioprocess processes anyway when vdsm terminates.
This should fix the wrong teardown during shutdown:
https://gerrit.ovirt.org/c/vdsm/+/117616
Didi: this may also be related to:
https://bugzilla.redhat.com/1986732
If vdsm is stopped during that flow. During shutdown, it will teardown
the hosted
engine storage domain. Tearing down the storage domain should happen only when
vdsm is asked to deactivate and disconnect from the storage, not
during shutdown.
Thank you for all the explanations, it makes sense now.
...
If you know how to reproduce the issue, please try with this fix
and report it it makes a difference.
vdsmd stops cleanly now in an environment where it previously didn't,
thank you for the fix.

It cannot be reliably tested with OST itself because ost-*-host-1, on
which the test usually runs, is a Node host preventing package upgrades
from normal repos and thus repo passed via --custom-repo is effectively
ignored :-O (I wonder what we actually test in gating on Jenkins).  The
best way to test whether the shutdown fix avoids the OST failures would
be to merge the patch, I think.
...
Thanks for the detailed report!
Nir