On Mon, Nov 15, 2021 at 1:01 PM Milan Zamazal <mzamazal(a)redhat.com> wrote:
Nir Soffer <nsoffer(a)redhat.com> writes:
> On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal <mzamazal(a)redhat.com> wrote:
>>
>> Hi,
>
>>
>> Michal has observed occasional OST failures in test_vdsm_recovery last
>> days, which hadn't been seen before. When `systemctl stop vdsmd' is
>> called (via Ansible) there, vdsmd (almost?) never finishes its shutdown
>> within the 10 seconds timeout and gets then killed with SIGKILL. If
>> this action is accompanied by "Job for vdsmd.service canceled."
message
>> then the test fails; otherwise OST continues normally.
>>
>> The situation is reproducible by running OST basic-suite-master and
>> making it artificially failing after test_vdsm_recovery. Then running
>> `systemctl stop vdsmd' manually on the given OST host (can be done
>> repeatedly, so it provides a good opportunity to examine the problem).
>>
>> There are two problems there:
>>
>> - "Job for vdsmd.service canceled." message that sometimes occurs
after
>> `systemctl stop vdsmd' and then the test fails. I don't know what it
>> means and I can't identify any difference in journal between when the
>> message occurs and when it doesn't.
>
> This is likely caused by hosted engine agent discovering that hosted engine
> storage domain was deactivated and panicking. systemd will try to restart
> the hosted engine agent, and since it requires vdsm, systemd may start vdsm
> while it is being shut down, and you will get this "canceled" message.
>
> Basically if you want to stop vdsm you must stop other services requiring it
> on the host.
>
> If we want to avoid this craziness, we need to fix other services not to require
> vdsm. It is not possible to manage a service when other services require it.
>
>> - The fact that Vdsm doesn't stop within the timeout and must be killed.
>> This doesn't happen in my normal oVirt installation. It apparently
>> blocks in self.irs.prepareForShutdown() call from clientIF.py.
>
> This is normal in the last 7-8 years. Vdsm storage shutdown is broken
> and is likely
> to get stuck during shutdown. However, It never caused trouble in the past.
>
>> Journal says:
>>
>> systemd[1]: Stopping Virtual Desktop Server Manager...
>> systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing.
>> systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal
SIGKILL.
>> systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal
SIGKILL.
>> systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal
SIGKILL.
>> systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal
SIGKILL.
>> systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL
>> systemd[1]: vdsmd.service: Failed with result 'timeout'.
>> systemd[1]: Stopped Virtual Desktop Server Manager.
>>
>> And vdsm.log (from a different run, sorry):
> ...
>> 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor]
>> Shutting down domain monitors (monitor:243)
>
> This is shutdown flow - good...
>
>> 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor]
>> Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True)
>> (monitor:268)
>
> We stop the monitor with the shutdown flag - good...
>
>> 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor]
>> Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True)
>> (monitor:268)
> ...
>
>> 2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e)
>> [storage.check] Stop checking
>>
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata'
>> (check:135)
>
> We stopped checking the domain - good...
>
>> 2021-11-12 07:09:32,867+0000 INFO (monitor/186180c)
>> [storage.blocksd] Tearing down domain
>> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
>
> Tearing down the storage domain - bad!
>
> This is a bug that was hidden for several years since tearing down a
> storage domain
> did not really tear down the storage. Now that we really tear down the
> storage, it will
> deactivate logical volumes and remove device mapper devices.
>
>> 2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c)
>> [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n
>> /sbin/lvm vgchange --config 'devices {
>> preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1
>> write_cache_state=0 disable_after_error_count=3
>>
filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
>> "r|.*|"] hints="none" obtain_device_list_from_udev=0 }
global {
>> locking_type=1 prioritise_write_locks=1 wait_for_locks=1
>> use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0
>> }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None)
>> (commands:154)
>
> This an attempt to deactivate a logical volume that should not happen
> during shutdown...
>
>> 2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c)
>> [common.commands] FAILED: <err> = b' Logical volume
>> 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate
>> volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open
>> logical volume(s)\n'; <rc> = 5 (commands:186)
>
> And it fails, since the logical volume is in use.
>
>> What is interesting here is that ioprocess is started here rather than
>> being killed as on clean vdsmd shutdowns.
>
> This is normal, when systemd kill ioprocess, it try to restart. We need to
> improve iopprocess retry mechanism to wait before restarting, so it can
> learn that we are during a shutdown. Basically this is just noise since systemd
> will kill these ioprocess processes anyway when vdsm terminates.
>
> This should fix the wrong teardown during shutdown:
>
https://gerrit.ovirt.org/c/vdsm/+/117616
>
> Didi: this may also be related to:
>
https://bugzilla.redhat.com/1986732
>
> If vdsm is stopped during that flow. During shutdown, it will teardown
> the hosted
> engine storage domain. Tearing down the storage domain should happen only when
> vdsm is asked to deactivate and disconnect from the storage, not
> during shutdown.
Thank you for all the explanations, it makes sense now.
> If you know how to reproduce the issue, please try with this fix
> and report it it makes a difference.
vdsmd stops cleanly now in an environment where it previously didn't,
thank you for the fix.
It cannot be reliably tested with OST itself because ost-*-host-1, on
which the test usually runs, is a Node host preventing package upgrades
from normal repos and thus repo passed via --custom-repo is effectively
ignored :-O (I wonder what we actually test in gating on Jenkins). The
best way to test whether the shutdown fix avoids the OST failures would
be to merge the patch, I think.
We need a bug for this, this is a regression in ~4.4.6, and need to be
backported
to 4.4.10.
Since you have all the details, can you file a bug?