On Mon, Feb 14, 2022 at 10:51 AM Petr Kyselák <kissi777(a)gmail.com> wrote:
Hi,
I see a lot of errors in vdsm.log
2022-02-14 08:42:52,086+0100 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox
65 checksum failed, not clearing mailbox, clearing new mail (data=b'\xff\xff\xff\xff\
<lot of data> \x00\x00', checksum=<function checksum at 0x7f2454712b70>,
expected=b'\xbfG\x00\x00') (mailbox:602)
2022-02-14 08:42:52,087+0100 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] mailbox
66 checksum failed, not clearing mailbox, clearing new mail (data=b'\x00\x00\x00\x00\
<lot of data> \xff\xff', checksum=<function checksum at 0x7f2454712b70>,
expected=b'\x04\xf0\x0b\x00') (mailbox:602)
This can be a real checksum error, meaning random failure on storage,
but is more likely a race in ovirt itself. We had lot of these in the past and
I think we fixed them but it is possible that we have more due to the way
this code works.
We are running latest ovirt engine and hosts:
Hosts: ovirt-node-ng-installer-4.4.10-2022020214.el8.iso
engine: ovirt-engine-4.4.10.6-1.el8.noarch
We have 3 hosts and 8 iSCSI domains. I found similar issue from 2018
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/FJ6KIEOXEEFF...
I am not sure how to determinate which mailbox I should try to "clean". Can
anybody help me please?
You don't need to do anything, the mailbox already cleaned up.
This message means that the SPM found bad checksum and drop the
messages in the mailbox.
Processes that sent mail to the SPM will resed dropped mail in 2-3 seconds,
so the issue should be recovered automatically.
I would monitor your logs to check if this is a common issue, or one time
incident. If this error is repeating, please file a vdsm bug and attach complete
log since this host was started.
Nir