this needs to be fixed asap, it's failing very frequently for 2+ days
what's also changing is the underlying OS of course, it doesn't necessarily have
to be a vdsm change
On 12. 11. 2021, at 20:15, Milan Zamazal <mzamazal(a)redhat.com>
wrote:
Hi,
Michal has observed occasional OST failures in test_vdsm_recovery last
days, which hadn't been seen before. When `systemctl stop vdsmd' is
called (via Ansible) there, vdsmd (almost?) never finishes its shutdown
within the 10 seconds timeout and gets then killed with SIGKILL. If
this action is accompanied by "Job for vdsmd.service canceled." message
then the test fails; otherwise OST continues normally.
The situation is reproducible by running OST basic-suite-master and
making it artificially failing after test_vdsm_recovery. Then running
`systemctl stop vdsmd' manually on the given OST host (can be done
repeatedly, so it provides a good opportunity to examine the problem).
There are two problems there:
- "Job for vdsmd.service canceled." message that sometimes occurs after
`systemctl stop vdsmd' and then the test fails. I don't know what it
means and I can't identify any difference in journal between when the
message occurs and when it doesn't.
- The fact that Vdsm doesn't stop within the timeout and must be killed.
This doesn't happen in my normal oVirt installation. It apparently
blocks in self.irs.prepareForShutdown() call from clientIF.py.
Journal says:
systemd[1]: Stopping Virtual Desktop Server Manager...
systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing.
systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal SIGKILL.
systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal SIGKILL.
systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal SIGKILL.
systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal SIGKILL.
systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL
systemd[1]: vdsmd.service: Failed with result 'timeout'.
systemd[1]: Stopped Virtual Desktop Server Manager.
And vdsm.log (from a different run, sorry):
2021-11-12 07:09:30,274+0000 INFO (MainThread) [vdsm.api] START prepareForShutdown()
from=internal, task_id=21b12bbd-1d61-4217-b92d-641a53d5f7bb (api:48)
2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [vds] VM channels listener thread has
ended. (vmchannels:214)
2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [root] FINISH thread
<Thread(vmchannels, stopped daemon 140163095193344)> (concurrent:261)
2021-11-12 07:09:30,516+0000 DEBUG (mailbox-hsm/4) [root] FINISH thread
<Thread(mailbox-hsm/4, stopped daemon 140162799089408)> (concurrent:261)
2021-11-12 07:09:30,517+0000 INFO (ioprocess/143197) [IOProcess]
(3a729aa1-8e14-4ea0-8794-e3d67fbde542) Starting ioprocess (__init__:465)
2021-11-12 07:09:30,521+0000 INFO (ioprocess/143199) [IOProcess]
(ost-he-basic-suite-master-storage:_exports_nfs_share1) Starting ioprocess (__init__:465)
2021-11-12 07:09:30,535+0000 INFO (ioprocess/143193) [IOProcess]
(ost-he-basic-suite-master-storage:_exports_nfs_share2) Starting ioprocess (__init__:465)
2021-11-12 07:09:30,679+0000 INFO (ioprocess/143195) [IOProcess]
(0187cf2f-2344-48de-a2a0-dd007315399f) Starting ioprocess (__init__:465)
2021-11-12 07:09:30,719+0000 INFO (ioprocess/143192) [IOProcess]
(15fa3d6c-671b-46ef-af9a-00337011fa26) Starting ioprocess (__init__:465)
2021-11-12 07:09:30,756+0000 INFO (ioprocess/143194) [IOProcess]
(ost-he-basic-suite-master-storage:_exports_nfs_exported) Starting ioprocess
(__init__:465)
2021-11-12 07:09:30,768+0000 INFO (ioprocess/143198) [IOProcess]
(ost-he-basic-suite-master-storage:_exports_nfs__he) Starting ioprocess (__init__:465)
2021-11-12 07:09:30,774+0000 INFO (ioprocess/143196) [IOProcess]
(a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c) Starting ioprocess (__init__:465)
2021-11-12 07:09:30,957+0000 DEBUG (mailbox-hsm/2) [root] FINISH thread
<Thread(mailbox-hsm/2, stopped daemon 140162815874816)> (concurrent:261)
2021-11-12 07:09:31,629+0000 INFO (mailbox-hsm) [storage.mailbox.hsmmailmonitor]
HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail
(mailbox:500)
2021-11-12 07:09:31,629+0000 INFO (mailbox-hsm) [storage.mailbox.hsmmailmonitor]
HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd',
'of=/rhev/data-center/f54c6052-437f-11ec-9094-54527d140533/mastersd/dom_md/inbox',
'iflag=fullblock', 'oflag=direct', 'conv=notrunc',
'bs=4096', 'count=1', 'seek=2'] (mailbox:382)
2021-11-12 07:09:32,610+0000 DEBUG (mailbox-hsm/1) [root] FINISH thread
<Thread(mailbox-hsm/1, stopped daemon 140162841052928)> (concurrent:261)
2021-11-12 07:09:32,792+0000 DEBUG (mailbox-hsm/3) [root] FINISH thread
<Thread(mailbox-hsm/3, stopped daemon 140162807482112)> (concurrent:261)
2021-11-12 07:09:32,818+0000 DEBUG (mailbox-hsm/0) [root] FINISH thread
<Thread(mailbox-hsm/0, stopped daemon 140162824267520)> (concurrent:261)
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Shutting down domain
monitors (monitor:243)
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring
15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring
a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring
3a729aa1-8e14-4ea0-8794-e3d67fbde542 (shutdown=True) (monitor:268)
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring
0187cf2f-2344-48de-a2a0-dd007315399f (shutdown=True) (monitor:268)
2021-11-12 07:09:32,820+0000 INFO (MainThread) [storage.monitor] Stop monitoring
186180cb-5cc5-4aa4-868a-9e1ed7965ddf (shutdown=True) (monitor:268)
2021-11-12 07:09:32,831+0000 INFO (monitor/0187cf2) [storage.check] Stop checking
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_exported/0187cf2f-2344-48de-a2a0-dd007315399f/dom_md/metadata'
(check:135)
2021-11-12 07:09:32,838+0000 INFO (monitor/a8bab4e) [storage.check] Stop checking
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata'
(check:135)
2021-11-12 07:09:32,844+0000 INFO (monitor/15fa3d6) [storage.check] Stop checking
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs__he/15fa3d6c-671b-46ef-af9a-00337011fa26/dom_md/metadata'
(check:135)
2021-11-12 07:09:32,864+0000 INFO (monitor/3a729aa) [storage.check] Stop checking
'/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share2/3a729aa1-8e14-4ea0-8794-e3d67fbde542/dom_md/metadata'
(check:135)
2021-11-12 07:09:32,865+0000 INFO (monitor/186180c) [storage.check] Stop checking
'/dev/186180cb-5cc5-4aa4-868a-9e1ed7965ddf/metadata' (check:135)
2021-11-12 07:09:32,867+0000 INFO (monitor/186180c) [storage.blocksd] Tearing down
domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands]
/usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices
{ preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3
filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
"r|.*|"] hints="none" obtain_device_list_from_udev=0 } global {
locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1
} backup { retain_min=50 retain_days=0 }' --available n
186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED:
<err> = b' Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n
Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1
open logical volume(s)\n'; <rc> = 5 (commands:186)
2021-11-12 07:09:38,066+0000 WARN (monitor/186180c) [storage.lvm] Command with
specific filter failed or returned no data, retrying with a wider filter: LVM command
failed: 'cmd=[\'/sbin/lvm\', \'vgchange\', \'--config\',
\'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3
filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|",
"r|.*|"] hints="none" obtain_device_list_from_udev=0 } global {
locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1
} backup { retain_min=50 retain_days=0 }\', \'--available\', \'n\',
\'186180cb-5cc5-4aa4-868a-9e1ed7965ddf\'] rc=5 out=[] err=[\' Logical volume
186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\', \' Can\\\'t deactivate
volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical
volume(s)\']' (lvm:532)
2021-11-12 07:09:38,067+0000 DEBUG (monitor/186180c) [common.commands]
/usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices
{ preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1
write_cache_state=0 disable_after_error_count=3
filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405d40a7e761def4ed09aac63282$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|^/dev/mapper/36001405ece10ae48e99470295edda1bb$|^/dev/mapper/36001405f269e8e9efb34e5f9170b349b$|^/dev/mapper/36001405f8c5fe6c239e46a0976b842df$|",
"r|.*|"] hints="none" obtain_device_list_from_udev=0 } global {
locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1
} backup { retain_min=50 retain_days=0 }' --available n
186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
What is interesting here is that ioprocess is started here rather than
being killed as on clean vdsmd shutdowns.
Any ideas?
Thanks,
Milan