OST: Vdsm: Occasional failures when stopping vdsmd

Hi, Michal has observed occasional OST failures in test_vdsm_recovery last days, which hadn't been seen before. When `systemctl stop vdsmd' is called (via Ansible) there, vdsmd (almost?) never finishes its shutdown within the 10 seconds timeout and gets then killed with SIGKILL. If this action is accompanied by "Job for vdsmd.service canceled." message then the test fails; otherwise OST continues normally. The situation is reproducible by running OST basic-suite-master and making it artificially failing after test_vdsm_recovery. Then running `systemctl stop vdsmd' manually on the given OST host (can be done repeatedly, so it provides a good opportunity to examine the problem). There are two problems there: - "Job for vdsmd.service canceled." message that sometimes occurs after `systemctl stop vdsmd' and then the test fails. I don't know what it means and I can't identify any difference in journal between when the message occurs and when it doesn't. - The fact that Vdsm doesn't stop within the timeout and must be killed. This doesn't happen in my normal oVirt installation. It apparently blocks in self.irs.prepareForShutdown() call from clientIF.py. Journal says: systemd[1]: Stopping Virtual Desktop Server Manager... systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing. systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL systemd[1]: vdsmd.service: Failed with result 'timeout'. systemd[1]: Stopped Virtual Desktop Server Manager. And vdsm.log (from a different run, sorry): 2021-11-12 07:09:30,274+0000 INFO (MainThread) [vdsm.api] START prepareForShutdown() from=internal, task_id=21b12bbd-1d61-4217-b92d-641a53d5f7bb (api:48) 2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [vds] VM channels listener thread has ended. (vmchannels:214) 2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [root] FINISH thread <Thread(vmchannels, stopped daemon 140163095193344)> (concurrent:261) 2021-11-12 07:09:30,516+0000 DEBUG (mailbox-hsm/4) [root] FINISH thread <Thread(mailbox-hsm/4, stopped daemon 140162799089408)> (concurrent:261) 2021-11-12 07:09:30,517+0000 INFO (ioprocess/143197) [IOProcess] (3a729aa1-8e14-4ea0-8794-e3d67fbde542) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,521+0000 INFO (ioprocess/143199) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs_share1) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,535+0000 INFO (ioprocess/143193) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs_share2) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,679+0000 INFO (ioprocess/143195) [IOProcess] (0187cf2f-2344-48de-a2a0-dd007315399f) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,719+0000 INFO (ioprocess/143192) [IOProcess] (15fa3d6c-671b-46ef-af9a-00337011fa26) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,756+0000 INFO (ioprocess/143194) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs_exported) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,768+0000 INFO (ioprocess/143198) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs__he) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,774+0000 INFO (ioprocess/143196) [IOProcess] (a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,957+0000 DEBUG (mailbox-hsm/2) [root] FINISH thread <Thread(mailbox-hsm/2, stopped daemon 140162815874816)> (concurrent:261) 2021-11-12 07:09:31,629+0000 INFO (mailbox-hsm) [storage.mailbox.hsmmailmonitor] HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail (mailbox:500) 2021-11-12 07:09:31,629+0000 INFO (mailbox-hsm) [storage.mailbox.hsmmailmonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/f54c6052-437f-11ec-9094-54527d140533/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=2'] (mailbox:382) 2021-11-12 07:09:32,610+0000 DEBUG (mailbox-hsm/1) [root] FINISH thread <Thread(mailbox-hsm/1, stopped daemon 140162841052928)> (concurrent:261) 2021-11-12 07:09:32,792+0000 DEBUG (mailbox-hsm/3) [root] FINISH thread <Thread(mailbox-hsm/3, stopped daemon 140162807482112)> (concurrent:261) 2021-11-12 07:09:32,818+0000 DEBUG (mailbox-hsm/0) [root] FINISH thread <Thread(mailbox-hsm/0, stopped daemon 140162824267520)> (concurrent:261) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Shutting down domain monitors (monitor:243) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 3a729aa1-8e14-4ea0-8794-e3d67fbde542 (shutdown=True) (monitor:268) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 0187cf2f-2344-48de-a2a0-dd007315399f (shutdown=True) (monitor:268) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (shutdown=True) (monitor:268) 2021-11-12 07:09:32,831+0000 INFO (monitor/0187cf2) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_exported/0187cf2f-2344-48de-a2a0-dd007315399f/dom_md/metadata' (check:135) 2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' (check:135) 2021-11-12 07:09:32,844+0000 INFO (monitor/15fa3d6) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs__he/15fa3d6c-671b-46ef-af9a-00337011fa26/dom_md/metadata' (check:135) 2021-11-12 07:09:32,864+0000 INFO (monitor/3a729aa) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share2/3a729aa1-8e14-4ea0-8794-e3d67fbde542/dom_md/metadata' (check:135) 2021-11-12 07:09:32,865+0000 INFO (monitor/186180c) [storage.check] Stop checking '/dev/186180cb-5cc5-4aa4-868a-9e1ed7965ddf/metadata' (check:135) 2021-11-12 07:09:32,867+0000 INFO (monitor/186180c) [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996) 2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154) 2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED: <err> = b' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> = 5 (commands:186) 2021-11-12 07:09:38,066+0000 WARN (monitor/186180c) [storage.lvm] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgchange\', \'--config\', \'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }\', \'--available\', \'n\', \'186180cb-5cc5-4aa4-868a-9e1ed7965ddf\'] rc=5 out=[] err=[\' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\', \' Can\\\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\']' (lvm:532) 2021-11-12 07:09:38,067+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405d40a7e761def4ed09aac63282$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|^/dev/mapper/36001405ece10ae48e99470295edda1bb$|^/dev/mapper/36001405f269e8e9efb34e5f9170b349b$|^/dev/mapper/36001405f8c5fe6c239e46a0976b842df$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154) What is interesting here is that ioprocess is started here rather than being killed as on clean vdsmd shutdowns. Any ideas? Thanks, Milan

this needs to be fixed asap, it's failing very frequently for 2+ days what's also changing is the underlying OS of course, it doesn't necessarily have to be a vdsm change
On 12. 11. 2021, at 20:15, Milan Zamazal <mzamazal@redhat.com> wrote:
Hi,
Michal has observed occasional OST failures in test_vdsm_recovery last days, which hadn't been seen before. When `systemctl stop vdsmd' is called (via Ansible) there, vdsmd (almost?) never finishes its shutdown within the 10 seconds timeout and gets then killed with SIGKILL. If this action is accompanied by "Job for vdsmd.service canceled." message then the test fails; otherwise OST continues normally.
The situation is reproducible by running OST basic-suite-master and making it artificially failing after test_vdsm_recovery. Then running `systemctl stop vdsmd' manually on the given OST host (can be done repeatedly, so it provides a good opportunity to examine the problem).
There are two problems there:
- "Job for vdsmd.service canceled." message that sometimes occurs after `systemctl stop vdsmd' and then the test fails. I don't know what it means and I can't identify any difference in journal between when the message occurs and when it doesn't.
- The fact that Vdsm doesn't stop within the timeout and must be killed. This doesn't happen in my normal oVirt installation. It apparently blocks in self.irs.prepareForShutdown() call from clientIF.py. Journal says:
systemd[1]: Stopping Virtual Desktop Server Manager... systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing. systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL systemd[1]: vdsmd.service: Failed with result 'timeout'. systemd[1]: Stopped Virtual Desktop Server Manager.
And vdsm.log (from a different run, sorry):
2021-11-12 07:09:30,274+0000 INFO (MainThread) [vdsm.api] START prepareForShutdown() from=internal, task_id=21b12bbd-1d61-4217-b92d-641a53d5f7bb (api:48) 2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [vds] VM channels listener thread has ended. (vmchannels:214) 2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [root] FINISH thread <Thread(vmchannels, stopped daemon 140163095193344)> (concurrent:261) 2021-11-12 07:09:30,516+0000 DEBUG (mailbox-hsm/4) [root] FINISH thread <Thread(mailbox-hsm/4, stopped daemon 140162799089408)> (concurrent:261) 2021-11-12 07:09:30,517+0000 INFO (ioprocess/143197) [IOProcess] (3a729aa1-8e14-4ea0-8794-e3d67fbde542) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,521+0000 INFO (ioprocess/143199) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs_share1) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,535+0000 INFO (ioprocess/143193) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs_share2) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,679+0000 INFO (ioprocess/143195) [IOProcess] (0187cf2f-2344-48de-a2a0-dd007315399f) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,719+0000 INFO (ioprocess/143192) [IOProcess] (15fa3d6c-671b-46ef-af9a-00337011fa26) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,756+0000 INFO (ioprocess/143194) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs_exported) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,768+0000 INFO (ioprocess/143198) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs__he) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,774+0000 INFO (ioprocess/143196) [IOProcess] (a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c) Starting ioprocess (__init__:465) 2021-11-12 07:09:30,957+0000 DEBUG (mailbox-hsm/2) [root] FINISH thread <Thread(mailbox-hsm/2, stopped daemon 140162815874816)> (concurrent:261) 2021-11-12 07:09:31,629+0000 INFO (mailbox-hsm) [storage.mailbox.hsmmailmonitor] HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail (mailbox:500) 2021-11-12 07:09:31,629+0000 INFO (mailbox-hsm) [storage.mailbox.hsmmailmonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/f54c6052-437f-11ec-9094-54527d140533/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=2'] (mailbox:382) 2021-11-12 07:09:32,610+0000 DEBUG (mailbox-hsm/1) [root] FINISH thread <Thread(mailbox-hsm/1, stopped daemon 140162841052928)> (concurrent:261) 2021-11-12 07:09:32,792+0000 DEBUG (mailbox-hsm/3) [root] FINISH thread <Thread(mailbox-hsm/3, stopped daemon 140162807482112)> (concurrent:261) 2021-11-12 07:09:32,818+0000 DEBUG (mailbox-hsm/0) [root] FINISH thread <Thread(mailbox-hsm/0, stopped daemon 140162824267520)> (concurrent:261) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Shutting down domain monitors (monitor:243) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 3a729aa1-8e14-4ea0-8794-e3d67fbde542 (shutdown=True) (monitor:268) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 0187cf2f-2344-48de-a2a0-dd007315399f (shutdown=True) (monitor:268) 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (shutdown=True) (monitor:268) 2021-11-12 07:09:32,831+0000 INFO (monitor/0187cf2) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_exported/0187cf2f-2344-48de-a2a0-dd007315399f/dom_md/metadata' (check:135) 2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' (check:135) 2021-11-12 07:09:32,844+0000 INFO (monitor/15fa3d6) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs__he/15fa3d6c-671b-46ef-af9a-00337011fa26/dom_md/metadata' (check:135) 2021-11-12 07:09:32,864+0000 INFO (monitor/3a729aa) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share2/3a729aa1-8e14-4ea0-8794-e3d67fbde542/dom_md/metadata' (check:135) 2021-11-12 07:09:32,865+0000 INFO (monitor/186180c) [storage.check] Stop checking '/dev/186180cb-5cc5-4aa4-868a-9e1ed7965ddf/metadata' (check:135) 2021-11-12 07:09:32,867+0000 INFO (monitor/186180c) [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996) 2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154) 2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED: <err> = b' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> = 5 (commands:186) 2021-11-12 07:09:38,066+0000 WARN (monitor/186180c) [storage.lvm] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgchange\', \'--config\', \'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }\', \'--available\', \'n\', \'186180cb-5cc5-4aa4-868a-9e1ed7965ddf\'] rc=5 out=[] err=[\' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\', \' Can\\\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\']' (lvm:532) 2021-11-12 07:09:38,067+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405d40a7e761def4ed09aac63282$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|^/dev/mapper/36001405ece10ae48e99470295edda1bb$|^/dev/mapper/36001405f269e8e9efb34e5f9170b349b$|^/dev/mapper/36001405f8c5fe6c239e46a0976b842df$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
What is interesting here is that ioprocess is started here rather than being killed as on clean vdsmd shutdowns.
Any ideas?
Thanks, Milan

On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal <mzamazal@redhat.com> wrote:
Hi,
Michal has observed occasional OST failures in test_vdsm_recovery last days, which hadn't been seen before. When `systemctl stop vdsmd' is called (via Ansible) there, vdsmd (almost?) never finishes its shutdown within the 10 seconds timeout and gets then killed with SIGKILL. If this action is accompanied by "Job for vdsmd.service canceled." message then the test fails; otherwise OST continues normally.
The situation is reproducible by running OST basic-suite-master and making it artificially failing after test_vdsm_recovery. Then running `systemctl stop vdsmd' manually on the given OST host (can be done repeatedly, so it provides a good opportunity to examine the problem).
There are two problems there:
- "Job for vdsmd.service canceled." message that sometimes occurs after `systemctl stop vdsmd' and then the test fails. I don't know what it means and I can't identify any difference in journal between when the message occurs and when it doesn't.
This is likely caused by hosted engine agent discovering that hosted engine storage domain was deactivated and panicking. systemd will try to restart the hosted engine agent, and since it requires vdsm, systemd may start vdsm while it is being shut down, and you will get this "canceled" message. Basically if you want to stop vdsm you must stop other services requiring it on the host. If we want to avoid this craziness, we need to fix other services not to require vdsm. It is not possible to manage a service when other services require it.
- The fact that Vdsm doesn't stop within the timeout and must be killed. This doesn't happen in my normal oVirt installation. It apparently blocks in self.irs.prepareForShutdown() call from clientIF.py.
This is normal in the last 7-8 years. Vdsm storage shutdown is broken and is likely to get stuck during shutdown. However, It never caused trouble in the past.
Journal says:
systemd[1]: Stopping Virtual Desktop Server Manager... systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing. systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL systemd[1]: vdsmd.service: Failed with result 'timeout'. systemd[1]: Stopped Virtual Desktop Server Manager.
And vdsm.log (from a different run, sorry): ... 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Shutting down domain monitors (monitor:243)
This is shutdown flow - good...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
We stop the monitor with the shutdown flag - good...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
...
2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' (check:135)
We stopped checking the domain - good...
2021-11-12 07:09:32,867+0000 INFO (monitor/186180c) [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
Tearing down the storage domain - bad! This is a bug that was hidden for several years since tearing down a storage domain did not really tear down the storage. Now that we really tear down the storage, it will deactivate logical volumes and remove device mapper devices.
2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
This an attempt to deactivate a logical volume that should not happen during shutdown...
2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED: <err> = b' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> = 5 (commands:186)
And it fails, since the logical volume is in use.
What is interesting here is that ioprocess is started here rather than being killed as on clean vdsmd shutdowns.
This is normal, when systemd kill ioprocess, it try to restart. We need to improve iopprocess retry mechanism to wait before restarting, so it can learn that we are during a shutdown. Basically this is just noise since systemd will kill these ioprocess processes anyway when vdsm terminates. This should fix the wrong teardown during shutdown: https://gerrit.ovirt.org/c/vdsm/+/117616 Didi: this may also be related to: https://bugzilla.redhat.com/1986732 If vdsm is stopped during that flow. During shutdown, it will teardown the hosted engine storage domain. Tearing down the storage domain should happen only when vdsm is asked to deactivate and disconnect from the storage, not during shutdown. If you know how to reproduce the issue, please try with this fix and report it it makes a difference. Thanks for the detailed report! Nir

Nir Soffer <nsoffer@redhat.com> writes:
On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal <mzamazal@redhat.com> wrote:
Hi,
Michal has observed occasional OST failures in test_vdsm_recovery last days, which hadn't been seen before. When `systemctl stop vdsmd' is called (via Ansible) there, vdsmd (almost?) never finishes its shutdown within the 10 seconds timeout and gets then killed with SIGKILL. If this action is accompanied by "Job for vdsmd.service canceled." message then the test fails; otherwise OST continues normally.
The situation is reproducible by running OST basic-suite-master and making it artificially failing after test_vdsm_recovery. Then running `systemctl stop vdsmd' manually on the given OST host (can be done repeatedly, so it provides a good opportunity to examine the problem).
There are two problems there:
- "Job for vdsmd.service canceled." message that sometimes occurs after `systemctl stop vdsmd' and then the test fails. I don't know what it means and I can't identify any difference in journal between when the message occurs and when it doesn't.
This is likely caused by hosted engine agent discovering that hosted engine storage domain was deactivated and panicking. systemd will try to restart the hosted engine agent, and since it requires vdsm, systemd may start vdsm while it is being shut down, and you will get this "canceled" message.
Basically if you want to stop vdsm you must stop other services requiring it on the host.
If we want to avoid this craziness, we need to fix other services not to require vdsm. It is not possible to manage a service when other services require it.
- The fact that Vdsm doesn't stop within the timeout and must be killed. This doesn't happen in my normal oVirt installation. It apparently blocks in self.irs.prepareForShutdown() call from clientIF.py.
This is normal in the last 7-8 years. Vdsm storage shutdown is broken and is likely to get stuck during shutdown. However, It never caused trouble in the past.
Journal says:
systemd[1]: Stopping Virtual Desktop Server Manager... systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing. systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL systemd[1]: vdsmd.service: Failed with result 'timeout'. systemd[1]: Stopped Virtual Desktop Server Manager.
And vdsm.log (from a different run, sorry): ... 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Shutting down domain monitors (monitor:243)
This is shutdown flow - good...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
We stop the monitor with the shutdown flag - good...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
...
2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' (check:135)
We stopped checking the domain - good...
2021-11-12 07:09:32,867+0000 INFO (monitor/186180c) [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
Tearing down the storage domain - bad!
This is a bug that was hidden for several years since tearing down a storage domain did not really tear down the storage. Now that we really tear down the storage, it will deactivate logical volumes and remove device mapper devices.
2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
This an attempt to deactivate a logical volume that should not happen during shutdown...
2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED: <err> = b' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> = 5 (commands:186)
And it fails, since the logical volume is in use.
What is interesting here is that ioprocess is started here rather than being killed as on clean vdsmd shutdowns.
This is normal, when systemd kill ioprocess, it try to restart. We need to improve iopprocess retry mechanism to wait before restarting, so it can learn that we are during a shutdown. Basically this is just noise since systemd will kill these ioprocess processes anyway when vdsm terminates.
This should fix the wrong teardown during shutdown: https://gerrit.ovirt.org/c/vdsm/+/117616
Didi: this may also be related to: https://bugzilla.redhat.com/1986732
If vdsm is stopped during that flow. During shutdown, it will teardown the hosted engine storage domain. Tearing down the storage domain should happen only when vdsm is asked to deactivate and disconnect from the storage, not during shutdown.
Thank you for all the explanations, it makes sense now.
If you know how to reproduce the issue, please try with this fix and report it it makes a difference.
vdsmd stops cleanly now in an environment where it previously didn't, thank you for the fix. It cannot be reliably tested with OST itself because ost-*-host-1, on which the test usually runs, is a Node host preventing package upgrades from normal repos and thus repo passed via --custom-repo is effectively ignored :-O (I wonder what we actually test in gating on Jenkins). The best way to test whether the shutdown fix avoids the OST failures would be to merge the patch, I think.
Thanks for the detailed report!
Nir

On Mon, Nov 15, 2021 at 1:01 PM Milan Zamazal <mzamazal@redhat.com> wrote:
Nir Soffer <nsoffer@redhat.com> writes:
On Fri, Nov 12, 2021 at 9:15 PM Milan Zamazal <mzamazal@redhat.com> wrote:
Hi,
Michal has observed occasional OST failures in test_vdsm_recovery last days, which hadn't been seen before. When `systemctl stop vdsmd' is called (via Ansible) there, vdsmd (almost?) never finishes its shutdown within the 10 seconds timeout and gets then killed with SIGKILL. If this action is accompanied by "Job for vdsmd.service canceled." message then the test fails; otherwise OST continues normally.
The situation is reproducible by running OST basic-suite-master and making it artificially failing after test_vdsm_recovery. Then running `systemctl stop vdsmd' manually on the given OST host (can be done repeatedly, so it provides a good opportunity to examine the problem).
There are two problems there:
- "Job for vdsmd.service canceled." message that sometimes occurs after `systemctl stop vdsmd' and then the test fails. I don't know what it means and I can't identify any difference in journal between when the message occurs and when it doesn't.
This is likely caused by hosted engine agent discovering that hosted engine storage domain was deactivated and panicking. systemd will try to restart the hosted engine agent, and since it requires vdsm, systemd may start vdsm while it is being shut down, and you will get this "canceled" message.
Basically if you want to stop vdsm you must stop other services requiring it on the host.
If we want to avoid this craziness, we need to fix other services not to require vdsm. It is not possible to manage a service when other services require it.
- The fact that Vdsm doesn't stop within the timeout and must be killed. This doesn't happen in my normal oVirt installation. It apparently blocks in self.irs.prepareForShutdown() call from clientIF.py.
This is normal in the last 7-8 years. Vdsm storage shutdown is broken and is likely to get stuck during shutdown. However, It never caused trouble in the past.
Journal says:
systemd[1]: Stopping Virtual Desktop Server Manager... systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing. systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal SIGKILL. systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL systemd[1]: vdsmd.service: Failed with result 'timeout'. systemd[1]: Stopped Virtual Desktop Server Manager.
And vdsm.log (from a different run, sorry): ... 2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Shutting down domain monitors (monitor:243)
This is shutdown flow - good...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
We stop the monitor with the shutdown flag - good...
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
...
2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' (check:135)
We stopped checking the domain - good...
2021-11-12 07:09:32,867+0000 INFO (monitor/186180c) [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
Tearing down the storage domain - bad!
This is a bug that was hidden for several years since tearing down a storage domain did not really tear down the storage. Now that we really tear down the storage, it will deactivate logical volumes and remove device mapper devices.
2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
This an attempt to deactivate a logical volume that should not happen during shutdown...
2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED: <err> = b' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> = 5 (commands:186)
And it fails, since the logical volume is in use.
What is interesting here is that ioprocess is started here rather than being killed as on clean vdsmd shutdowns.
This is normal, when systemd kill ioprocess, it try to restart. We need to improve iopprocess retry mechanism to wait before restarting, so it can learn that we are during a shutdown. Basically this is just noise since systemd will kill these ioprocess processes anyway when vdsm terminates.
This should fix the wrong teardown during shutdown: https://gerrit.ovirt.org/c/vdsm/+/117616
Didi: this may also be related to: https://bugzilla.redhat.com/1986732
If vdsm is stopped during that flow. During shutdown, it will teardown the hosted engine storage domain. Tearing down the storage domain should happen only when vdsm is asked to deactivate and disconnect from the storage, not during shutdown.
Thank you for all the explanations, it makes sense now.
If you know how to reproduce the issue, please try with this fix and report it it makes a difference.
vdsmd stops cleanly now in an environment where it previously didn't, thank you for the fix.
It cannot be reliably tested with OST itself because ost-*-host-1, on which the test usually runs, is a Node host preventing package upgrades from normal repos and thus repo passed via --custom-repo is effectively ignored :-O (I wonder what we actually test in gating on Jenkins). The best way to test whether the shutdown fix avoids the OST failures would be to merge the patch, I think.
We need a bug for this, this is a regression in ~4.4.6, and need to be backported to 4.4.10. Since you have all the details, can you file a bug?

Nir Soffer <nsoffer@redhat.com> writes:
We need a bug for this, this is a regression in ~4.4.6, and need to be backported to 4.4.10.
Since you have all the details, can you file a bug?
participants (3)
-
Michal Skrivanek
-
Milan Zamazal
-
Nir Soffer