ECC RAM everywhere: hosts and storage.
I even run Memtest86 on both hypervisor hosts just be sure. No errors. I haven’t had the
opportunity to run it on the storage yet.
After I’ve sent that message yesterday, the engine VM crashed again, filesystem went
offline. There was some discards (again) on the switch, probably due to the “boot storm”
of other VM’s. But this time a simple reboot fixed the filesystem and the hosted engine VM
was back.
Since it was an extremely small amount of time, I’ve checked everything again, and only
the discards issues came up, there are ~90k discards on Po2 (which is the LACP interface
of the hypervisor). Since the fact, I enabled hardware flow control on the ports of the
switch, but discards are still happening:
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
Po1 0 0 0 0 0 0
Po2 0 0 0 0 0 3650
Po3 0 0 0 0 0 0
Po4 0 0 0 0 0 0
Po5 0 0 0 0 0 0
Po6 0 0 0 0 0 0
Po7 0 0 0 0 0 0
Po20 0 0 0 0 0 13788
I think this may be related… but it’s just a guess.
Thanks,
On 1 Dec 2020, at 05:06, Strahil Nikolov
<hunter86_bg(a)yahoo.com> wrote:
Could it be faulty ram ?
Do you use ECC ram ?
Best Regards,
Strahil Nikolov
В вторник, 1 декември 2020 г., 06:17:10 Гринуич+2, Vinícius Ferrão via Users
<users(a)ovirt.org> написа:
Hi again,
I had to shutdown everything because of a power outage in the office. When trying to get
the infra up again, even the Engine have corrupted:
[ 772.466982] XFS (dm-4): Invalid superblock magic number
mount: /var: wrong fs type, bad option, bad superblock on /dev/mapper/ovirt-var, missing
codepage or helper program, or other error.
[ 772.472885] XFS (dm-3): Mounting V5 Filesystem
[ 773.629700] XFS (dm-3): Starting recovery (logdev: internal)
[ 773.731104] XFS (dm-3): Metadata CRC error detected at xfs_agfl_read_verify+0xa1/0xf0
[xfs], xfs_agfl block 0xf00003
[ 773.734352] XFS (dm-3): Unmount and run xfs_repair
[ 773.736216] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
[ 773.738458] 00000000: 23 31 31 35 36 35 35 34 29 00 2d 20 52 65 62 75 #1156554).-
Rebu
[ 773.741044] 00000010: 69 6c 74 20 66 6f 72 20 68 74 74 70 73 3a 2f 2f ilt for
https://
[ 773.743636] 00000020: 66 65 64 6f 72 61 70 72 6f 6a 65 63 74 2e 6f 72
fedoraproject.or
[ 773.746191] 00000030: 67 2f 77 69 6b 69 2f 46 65 64 6f 72 61 5f 32 33
g/wiki/Fedora_23
[ 773.748818] 00000040: 5f 4d 61 73 73 5f 52 65 62 75 69 6c 64 00 2d 20 _Mass_Rebuild.-
[ 773.751399] 00000050: 44 72 6f 70 20 6f 62 73 6f 6c 65 74 65 20 64 65 Drop obsolete
de
[ 773.753933] 00000060: 66 61 74 74 72 20 73 74 61 6e 7a 61 73 20 28 23 fattr stanzas
(#
[ 773.756428] 00000070: 31 30 34 37 30 33 31 29 00 2d 20 49 6e 73 74 61 1047031).-
Insta
[ 773.758873] XFS (dm-3): metadata I/O error in "xfs_trans_read_buf_map" at
daddr 0xf00003 len 1 error 74
[ 773.763756] XFS (dm-3): xfs_do_force_shutdown(0x8) called from line 446 of file
fs/xfs/libxfs/xfs_defer.c. Return address = 00000000962bd5ee
[ 773.769363] XFS (dm-3): Corruption of in-memory data detected. Shutting down
filesystem
[ 773.772643] XFS (dm-3): Please unmount the filesystem and rectify the problem(s)
[ 773.776079] XFS (dm-3): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[ 773.779113] XFS (dm-3): xlog_recover_clear_agi_bucket: failed to clear agi 3.
Continuing.
[ 773.783039] XFS (dm-3): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[ 773.785698] XFS (dm-3): xlog_recover_clear_agi_bucket: failed to clear agi 3.
Continuing.
[ 773.790023] XFS (dm-3): Ending recovery (logdev: internal)
[ 773.792489] XFS (dm-3): Error -5 recovering leftover CoW allocations.
mount: /var/log: can't read superblock on /dev/mapper/ovirt-log.
mount: /var/log/audit: mount point does not exist.
/var seems to be completely trashed.
The only time that I’ve seem something like this was faulty hardware. But nothing shows
up on logs, as far as I know.
After forcing repairs with -L I’ve got other issues:
mount -a
[ 326.170941] XFS (dm-4): Mounting V5 Filesystem
[ 326.404788] XFS (dm-4): Ending clean mount
[ 326.415291] XFS (dm-3): Mounting V5 Filesystem
[ 326.611673] XFS (dm-3): Ending clean mount
[ 326.621705] XFS (dm-2): Mounting V5 Filesystem
[ 326.784067] XFS (dm-2): Starting recovery (logdev: internal)
[ 326.792083] XFS (dm-2): Metadata CRC error detected at xfs_agi_read_verify+0xc7/0xf0
[xfs], xfs_agi block 0x2
[ 326.794445] XFS (dm-2): Unmount and run xfs_repair
[ 326.795557] XFS (dm-2): First 128 bytes of corrupted metadata buffer:
[ 326.797055] 00000000: 4d 33 44 34 39 56 00 00 80 00 00 00 f0 cf 00 00
M3D49V..........
[ 326.799685] 00000010: 00 00 00 00 02 00 00 00 23 10 00 00 3d 08 01 08
........#...=...
[ 326.802290] 00000020: 21 27 44 34 39 56 00 00 00 d0 00 00 01 00 00 00
!'D49V..........
[ 326.804748] 00000030: 50 00 00 00 00 00 00 00 23 10 00 00 41 01 08 08
P.......#...A...
[ 326.807296] 00000040: 21 27 44 34 39 56 00 00 10 d0 00 00 02 00 00 00
!'D49V..........
[ 326.809883] 00000050: 60 00 00 00 00 00 00 00 23 10 00 00 41 01 08 08
`.......#...A...
[ 326.812345] 00000060: 61 2f 44 34 39 56 00 00 00 00 00 00 00 00 00 00
a/D49V..........
[ 326.814831] 00000070: 50 34 00 00 00 00 00 00 23 10 00 00 82 08 08 04
P4......#.......
[ 326.817237] XFS (dm-2): metadata I/O error in "xfs_trans_read_buf_map" at
daddr 0x2 len 1 error 74
mount: /var/log/audit: mount(2) system call failed: Structure needs cleaning.
But after more xfs_repair -L the engine is up…
Now I need to scavenge other VMs and do the same thing.
That’s it.
Thanks all,
V.
PS: For those interested, there’s a paste of the fixes:
https://pastebin.com/jsMguw6j
>
> On 29 Nov 2020, at 17:03, Strahil Nikolov <hunter86_bg(a)yahoo.com> wrote:
>
>
>
> Damn...
>
> You are using EFI boot. Does this happen only to EFI machines ?
> Did you notice if only EL 8 is affected ?
>
> Best Regards,
> Strahil Nikolov
>
>
>
>
>
>
> В неделя, 29 ноември 2020 г., 19:36:09 Гринуич+2, Vinícius Ferrão
<ferrao(a)versatushpc.com.br> написа:
>
>
>
>
>
> Yes!
>
> I have a live VM right now that will de dead on a reboot:
>
> [root@kontainerscomk ~]# cat /etc/*release
> NAME="Red Hat Enterprise Linux"
> VERSION="8.3 (Ootpa)"
> ID="rhel"
> ID_LIKE="fedora"
> VERSION_ID="8.3"
> PLATFORM_ID="platform:el8"
> PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)"
> ANSI_COLOR="0;31"
> CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA"
>
HOME_URL="https://www.redhat.com/"
>
BUG_REPORT_URL="https://bugzilla.redhat.com/"
>
> REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
> REDHAT_BUGZILLA_PRODUCT_VERSION=8.3
> REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
> REDHAT_SUPPORT_PRODUCT_VERSION="8.3"
> Red Hat Enterprise Linux release 8.3 (Ootpa)
> Red Hat Enterprise Linux release 8.3 (Ootpa)
>
> [root@kontainerscomk ~]# sysctl -a | grep dirty
> vm.dirty_background_bytes = 0
> vm.dirty_background_ratio = 10
> vm.dirty_bytes = 0
> vm.dirty_expire_centisecs = 3000
> vm.dirty_ratio = 30
> vm.dirty_writeback_centisecs = 500
> vm.dirtytime_expire_seconds = 43200
>
> [root@kontainerscomk ~]# xfs_db -r /dev/dm-0
> xfs_db: /dev/dm-0 is not a valid XFS filesystem (unexpected SB magic number
0xa82a0000)
> Use -F to force a read attempt.
> [root@kontainerscomk ~]# xfs_db -r /dev/dm-0 -F
> xfs_db: /dev/dm-0 is not a valid XFS filesystem (unexpected SB magic number
0xa82a0000)
> xfs_db: size check failed
> xfs_db: V1 inodes unsupported. Please try an older xfsprogs.
>
> [root@kontainerscomk ~]# cat /etc/fstab
> #
> # /etc/fstab
> # Created by anaconda on Thu Nov 19 22:40:39 2020
> #
> # Accessible filesystems, by reference, are maintained under '/dev/disk/'.
> # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
> #
> # After editing this file, run 'systemctl daemon-reload' to update systemd
> # units generated from this file.
> #
> /dev/mapper/rhel-root / xfs defaults 0 0
> UUID=ad84d1ea-c9cc-4b22-8338-d1a6b2c7d27e /boot xfs defaults
0 0
> UUID=4642-2FF6 /boot/efi vfat umask=0077,shortname=winnt 0
2
> /dev/mapper/rhel-swap none swap defaults 0 0
>
> Thanks,
>
>
> -----Original Message-----
> From: Strahil Nikolov <hunter86_bg(a)yahoo.com>
> Sent: Sunday, November 29, 2020 2:33 PM
> To: Vinícius Ferrão <ferrao(a)versatushpc.com.br>
> Cc: users <users(a)ovirt.org>
> Subject: Re: [ovirt-users] Re: Constantly XFS in memory corruption inside VMs
>
> Can you check the output on the VM that was affected:
> # cat /etc/*release
> # sysctl -a | grep dirty
>
>
> Best Regards,
> Strahil Nikolov
>
>
>
>
>
> В неделя, 29 ноември 2020 г., 19:07:48 Гринуич+2, Vinícius Ferrão via Users
<users(a)ovirt.org> написа:
>
>
>
>
>
> Hi Strahil.
>
> I’m not using barrier options on mount. It’s the default settings from CentOS
install.
>
> I have some additional findings, there’s a big number of discarded packages on the
switch on the hypervisor interfaces.
>
> Discards are OK as far as I know, I hope TCP handles this and do the proper
retransmissions, but I ask if this may be related or not. Our storage is over NFS. My
general expertise is with iSCSI and I’ve never seen this kind of issue with iSCSI, not
that I’m aware of.
>
> In other clusters, I’ve seen a high number of discards with iSCSI on XenServer 7.2
but there’s no corruption on the VMs there...
>
> Thanks,
>
> Sent from my iPhone
>
>
>> On 29 Nov 2020, at 04:00, Strahil Nikolov <hunter86_bg(a)yahoo.com> wrote:
>>
>> Are you using "nobarrier" mount options in the VM ?
>>
>> If yes, can you try to remove the "nobarrrier" option.
>>
>>
>> Best Regards,
>> Strahil Nikolov
>>
>>
>>
>>
>>
>>
>> В събота, 28 ноември 2020 г., 19:25:48 Гринуич+2, Vinícius Ferrão
<ferrao(a)versatushpc.com.br> написа:
>>
>>
>>
>>
>>
>> Hi Strahil,
>>
>> I moved a running VM to other host, rebooted and no corruption was found. If
there's any corruption it may be silent corruption... I've cases where the VM was
new, just installed, run dnf -y update to get the updated packages, rebooted, and boom XFS
corruption. So perhaps the motion process isn't the one to blame.
>>
>> But, in fact, I remember when moving a VM that it went down during the process
and when I rebooted it was corrupted. But this may not seems related. It perhaps was
already in a inconsistent state.
>>
>> Anyway, here's the mount options:
>>
>> Host1:
>> 192.168.10.14:/mnt/pool0/ovirt/vm on
>> /rhev/data-center/mnt/192.168.10.14:_mnt_pool0_ovirt_vm type nfs4
>> (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,noshar
>> ecache,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.10.1,l
>> ocal_lock=none,addr=192.168.10.14)
>>
>> Host2:
>> 192.168.10.14:/mnt/pool0/ovirt/vm on
>> /rhev/data-center/mnt/192.168.10.14:_mnt_pool0_ovirt_vm type nfs4
>> (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,noshar
>> ecache,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.10.1,l
>> ocal_lock=none,addr=192.168.10.14)
>>
>> The options are the default ones. I haven't changed anything when configuring
this cluster.
>>
>> Thanks.
>>
>>
>>
>> -----Original Message-----
>> From: Strahil Nikolov <hunter86_bg(a)yahoo.com>
>> Sent: Saturday, November 28, 2020 1:54 PM
>> To: users <users(a)ovirt.org>; Vinícius Ferrão
>> <ferrao(a)versatushpc.com.br>
>> Subject: Re: [ovirt-users] Constantly XFS in memory corruption inside
>> VMs
>>
>> Can you try with a test vm, if this happens after a Virtual Machine migration ?
>>
>> What are your mount options for the storage domain ?
>>
>> Best Regards,
>> Strahil Nikolov
>>
>>
>>
>>
>>
>>
>> В събота, 28 ноември 2020 г., 18:25:15 Гринуич+2, Vinícius Ferrão via Users
<users(a)ovirt.org> написа:
>>
>>
>>
>>
>>
>>
>>
>>
>> Hello,
>>
>>
>>
>> I’m trying to discover why an oVirt 4.4.3 Cluster with two hosts and NFS shared
storage on TrueNAS 12.0 is constantly getting XFS corruption inside the VMs.
>>
>>
>>
>> For random reasons VM’s gets corrupted, sometimes halting it or just being silent
corrupted and after a reboot the system is unable to boot due to “corruption of in-memory
data detected”. Sometimes the corrupted data are “all zeroes”, sometimes there’s data
there. In extreme cases the XFS superblock 0 get’s corrupted and the system cannot even
detect a XFS partition anymore since the magic XFS key is corrupted on the first blocks of
the virtual disk.
>>
>>
>>
>> This is happening for a month now. We had to rollback some backups, and I don’t
trust anymore on the state of the VMs.
>>
>>
>>
>> Using xfs_db I can see that some VM’s have corrupted superblocks but the VM is
up. One in specific, was with sb0 corrupted, so I knew when a reboot kicks in the machine
will be gone, and that’s exactly what happened.
>>
>>
>>
>> Another day I was just installing a new CentOS 8 VM for random reasons, and after
running dnf -y update and a reboot the VM was corrupted needing XFS repair. That was an
extreme case.
>>
>>
>>
>> So, I’ve looked on the TrueNAS logs, and there’s apparently nothing wrong on the
system. No errors logged on dmesg, nothing on /var/log/messages and no errors on the
“zpools”, not even after scrub operations. On the switch, a Catalyst 2960X, we’ve been
monitoring it and all it’s interfaces. There are no “up and down” and zero errors on all
interfaces (we have a 4x Port LACP on the TrueNAS side and 2x Port LACP on each hosts),
everything seems to be fine. The only metric that I was unable to get is “dropped
packages”, but I’m don’t know if this can be an issue or not.
>>
>>
>>
>> Finally, on oVirt, I can’t find anything either. I looked on /var/log/messages
and /var/log/sanlock.log but there’s nothing that I found suspicious.
>>
>>
>>
>> Is there’s anyone out there experiencing this? Our VM’s are mainly CentOS 7/8
with XFS, there’s 3 Windows VM’s that does not seems to be affected, everything else is
affected.
>>
>>
>>
>> Thanks all.
>>
>>
>>
>> _______________________________________________
>> Users mailing list -- users(a)ovirt.org
>> To unsubscribe send an email to users-leave(a)ovirt.org Privacy
>> Statement:
https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct:
>>
https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/VLYSE7HC
>> FNWTWFZZTL2EJHV36OENHUGB/
>>
>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org Privacy Statement:
https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/CZ5E55LJMA7...
>
>
>
>
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/CFOSOGER2VD...