OST: Vdsm: Occasional failures when stopping vdsmd

12 Nov 2021

      Hi,

Michal has observed occasional OST failures in test_vdsm_recovery last
days, which hadn't been seen before.  When `systemctl stop vdsmd' is
called (via Ansible) there, vdsmd (almost?) never finishes its shutdown
within the 10 seconds timeout and gets then killed with SIGKILL.  If
this action is accompanied by "Job for vdsmd.service canceled." message
then the test fails; otherwise OST continues normally.

The situation is reproducible by running OST basic-suite-master and
making it artificially failing after test_vdsm_recovery.  Then running
`systemctl stop vdsmd' manually on the given OST host (can be done
repeatedly, so it provides a good opportunity to examine the problem).

There are two problems there:

- "Job for vdsmd.service canceled." message that sometimes occurs after
  `systemctl stop vdsmd' and then the test fails.  I don't know what it
  means and I can't identify any difference in journal between when the
  message occurs and when it doesn't.

- The fact that Vdsm doesn't stop within the timeout and must be killed.
  This doesn't happen in my normal oVirt installation.  It apparently
  blocks in self.irs.prepareForShutdown() call from clientIF.py.
  Journal says:

    systemd[1]: Stopping Virtual Desktop Server Manager...
    systemd[1]: vdsmd.service: State 'stop-sigterm' timed out. Killing.
    systemd[1]: vdsmd.service: Killing process 132608 (vdsmd) with signal SIGKILL.
    systemd[1]: vdsmd.service: Killing process 133445 (ioprocess) with signal SIGKILL.
    systemd[1]: vdsmd.service: Killing process 133446 (ioprocess) with signal SIGKILL.
    systemd[1]: vdsmd.service: Killing process 133447 (ioprocess) with signal SIGKILL.
    systemd[1]: vdsmd.service: Main process exited, code=killed, status=9/KILL
    systemd[1]: vdsmd.service: Failed with result 'timeout'.
    systemd[1]: Stopped Virtual Desktop Server Manager.

  And vdsm.log (from a different run, sorry):

    2021-11-12 07:09:30,274+0000 INFO  (MainThread) [vdsm.api] START prepareForShutdown() from=internal, task_id=21b12bbd-1d61-4217-b92d-641a53d5f7bb (api:48)
    2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [vds] VM channels listener thread has ended. (vmchannels:214)
    2021-11-12 07:09:30,317+0000 DEBUG (vmchannels) [root] FINISH thread <Thread(vmchannels, stopped daemon 140163095193344)> (concurrent:261)
    2021-11-12 07:09:30,516+0000 DEBUG (mailbox-hsm/4) [root] FINISH thread <Thread(mailbox-hsm/4, stopped daemon 140162799089408)> (concurrent:261)
    2021-11-12 07:09:30,517+0000 INFO  (ioprocess/143197) [IOProcess] (3a729aa1-8e14-4ea0-8794-e3d67fbde542) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,521+0000 INFO  (ioprocess/143199) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs_share1) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,535+0000 INFO  (ioprocess/143193) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs_share2) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,679+0000 INFO  (ioprocess/143195) [IOProcess] (0187cf2f-2344-48de-a2a0-dd007315399f) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,719+0000 INFO  (ioprocess/143192) [IOProcess] (15fa3d6c-671b-46ef-af9a-00337011fa26) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,756+0000 INFO  (ioprocess/143194) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs_exported) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,768+0000 INFO  (ioprocess/143198) [IOProcess] (ost-he-basic-suite-master-storage:_exports_nfs__he) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,774+0000 INFO  (ioprocess/143196) [IOProcess] (a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c) Starting ioprocess (__init__:465)
    2021-11-12 07:09:30,957+0000 DEBUG (mailbox-hsm/2) [root] FINISH thread <Thread(mailbox-hsm/2, stopped daemon 140162815874816)> (concurrent:261)
    2021-11-12 07:09:31,629+0000 INFO  (mailbox-hsm) [storage.mailbox.hsmmailmonitor] HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail (mailbox:500)
    2021-11-12 07:09:31,629+0000 INFO  (mailbox-hsm) [storage.mailbox.hsmmailmonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/f54c6052-437f-11ec-9094-54527d140533/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=2'] (mailbox:382)
    2021-11-12 07:09:32,610+0000 DEBUG (mailbox-hsm/1) [root] FINISH thread <Thread(mailbox-hsm/1, stopped daemon 140162841052928)> (concurrent:261)
    2021-11-12 07:09:32,792+0000 DEBUG (mailbox-hsm/3) [root] FINISH thread <Thread(mailbox-hsm/3, stopped daemon 140162807482112)> (concurrent:261)
    2021-11-12 07:09:32,818+0000 DEBUG (mailbox-hsm/0) [root] FINISH thread <Thread(mailbox-hsm/0, stopped daemon 140162824267520)> (concurrent:261)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Shutting down domain monitors (monitor:243)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 15fa3d6c-671b-46ef-af9a-00337011fa26 (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 3a729aa1-8e14-4ea0-8794-e3d67fbde542 (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 0187cf2f-2344-48de-a2a0-dd007315399f (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,820+0000 INFO  (MainThread) [storage.monitor] Stop monitoring 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (shutdown=True) (monitor:268)
    2021-11-12 07:09:32,831+0000 INFO  (monitor/0187cf2) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_exported/0187cf2f-2344-48de-a2a0-dd007315399f/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,838+0000 INFO  (monitor/a8bab4e) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share1/a8bab4ef-2952-4c42-ba44-dbb3e1b8c87c/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,844+0000 INFO  (monitor/15fa3d6) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs__he/15fa3d6c-671b-46ef-af9a-00337011fa26/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,864+0000 INFO  (monitor/3a729aa) [storage.check] Stop checking '/rhev/data-center/mnt/ost-he-basic-suite-master-storage:_exports_nfs_share2/3a729aa1-8e14-4ea0-8794-e3d67fbde542/dom_md/metadata' (check:135)
    2021-11-12 07:09:32,865+0000 INFO  (monitor/186180c) [storage.check] Stop checking '/dev/186180cb-5cc5-4aa4-868a-9e1ed7965ddf/metadata' (check:135)
    2021-11-12 07:09:32,867+0000 INFO  (monitor/186180c) [storage.blocksd] Tearing down domain 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (blockSD:996)
    2021-11-12 07:09:32,867+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)
    2021-11-12 07:09:38,063+0000 DEBUG (monitor/186180c) [common.commands] FAILED: <err> = b'  Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\n  Can\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\n'; <rc> = 5 (commands:186)
    2021-11-12 07:09:38,066+0000 WARN  (monitor/186180c) [storage.lvm] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'vgchange\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--available\', \'n\', \'186180cb-5cc5-4aa4-868a-9e1ed7965ddf\'] rc=5 out=[] err=[\'  Logical volume 186180cb-5cc5-4aa4-868a-9e1ed7965ddf/ids in use.\', \'  Can\\\'t deactivate volume group "186180cb-5cc5-4aa4-868a-9e1ed7965ddf" with 1 open logical volume(s)\']' (lvm:532)
    2021-11-12 07:09:38,067+0000 DEBUG (monitor/186180c) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/sudo -n /sbin/lvm vgchange --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405423a0b45cef54e6e9fd0c7df4$|^/dev/mapper/36001405d40a7e761def4ed09aac63282$|^/dev/mapper/36001405e9302e3f55e74aedbf352b79f$|^/dev/mapper/36001405ece10ae48e99470295edda1bb$|^/dev/mapper/36001405f269e8e9efb34e5f9170b349b$|^/dev/mapper/36001405f8c5fe6c239e46a0976b842df$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --available n 186180cb-5cc5-4aa4-868a-9e1ed7965ddf (cwd None) (commands:154)

  What is interesting here is that ioprocess is started here rather than
  being killed as on clean vdsmd shutdowns.

Any ideas?

Thanks,
Milan

Milan Zamazal

Michal Skrivanek

Nir Soffer

Milan Zamazal

Nir Soffer

Milan Zamazal

tags

participants (3)