On Thu, Dec 28, 2017 at 6:02 PM Dan Kenigsberg <danken@redhat.com> wrote:
On Thu, Dec 28, 2017 at 5:36 PM, Yaniv Kaul <ykaul@redhat.com> wrote:
> Just saw this on[1]:
> 2017-12-28 10:09:24,980-0500 INFO  (MainThread) [vdsm.api] FINISH
> prepareForShutdown return=None from=internal,
> task_id=1a83a4d3-43e3-4833-883b-dfd2fe4d76be (api:52)
> 2017-12-28 10:09:24,980-0500 INFO  (MainThread) [vds] Stopping threads
> (vdsmd:159)
> 2017-12-28 10:09:24,980-0500 INFO  (MainThread) [vds] Exiting (vdsmd:170)
> 2017-12-28 10:09:25,011-0500 INFO  (mailbox-hsm)
> [storage.MailBox.HsmMailMonitor] HSM_MailboxMonitor - Incoming mail
> monitoring thread stopped, clearing outgoing mail (mailbox:511)

This seems to be the issue, the monitor thread is trying to access storage
*after* it was stopped. Relying on being able to access storage to clean up
things during shutdown is never going to work. 
 
> 2017-12-28 10:09:25,011-0500 INFO  (mailbox-hsm)
> [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM -
> ['/usr/bin/dd',
> 'of=/rhev/data-center/9b97451a-0312-4c69-85e2-f86d6d273636/mastersd/dom_md/inbox',
> 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1',
> 'seek=1'] (mailbox:387)
> 2017-12-28 10:09:25,012-0500 ERROR (mailbox-hsm)
> [storage.MailBox.HsmMailMonitor] FINISH thread <Thread(mailbox-hsm, started
> daemon 139968009594624)> failed (concurrent:201)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line
> 194, in run
>     ret = func(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/mailbox.py", line 514,
> in _run
>     self._sendMail()  # Clear outgoing mailbox
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/mailbox.py", line 394,
> in _sendMail
>     _mboxExecCmd(self._outCmd, data=self._outgoingMail)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/mailbox.py", line 84,
> in _mboxExecCmd
>     return misc.execCmd(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/common/commands.py", line 53,
> in execCmd
>     command = cmdutils.wrap_command(command, with_ioclass=ioclass,
> AttributeError: 'NoneType' object has no attribute 'wrap_command'
> 2017-12-28 10:09:28,965-0500 INFO  (MainThread) [vds] (PID: 23232) I am the
> actual vdsm 4.20.9-84.git615770f.el7.centos lago-basic-suite-master-host-0
> (3.10.0-693.2.2.el7.x86_64) (vdsmd:148)
> 2017-12-28 10:09:28,966-0500 INFO  (MainThread) [vds] VDSM will run with cpu
> affinity: frozenset([1]) (vdsmd:254)
> 2017-12-28 10:09:28,970-0500 INFO  (MainThread) [storage.HSM] START HSM init
> (hsm:366)
>
>
> [1]
> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/3113/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-005_network_by_label.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log


It's quite ugly, but I think that it is harmless.
prepareForShutdown calls self._pool.hsmMailer.stop() but never waits
for it to actually finish. It then exists and Python starts freeing
its modules, such as cmdutils. When hsmMailer wakes up, cmdutils is
already None.

The fix could be as simple as https://gerrit.ovirt.org/85810 but I'm
not sure if it is worth the slowdown in vdsm shutdown.

We can wait for the mailer, but only after we stopped all other threads.

Nir