Hi All,
In the past few days, we have been having problems with hosts becoming unmanageable due to
multipathd identifying false failed paths and VDSM crashing because of it.
We were running version 4.4.1 and upgrading to 4.4.5(engine) and 4.4.6(nodes) seems to
have resolved the VDSM.
However, currently, we see that the multipathing events continue.
From what we have observed, the events start in correlation to the host reporting on low
swap space. The low swap space seems to be related to Commvault backup operation running.
By running top on the host while a backup operation is running I can see swap being
consumed to 100% although there is plenty of RAM available.
After the multipath events start happening the only means of stopping it was to reboot the
host.
This is the warning in oVirt UI:
Apr 24, 2021, 9:14:07 PM - Available swap memory of host Ovirt-Node2 [953 MB] is under
defined threshold [1024 MB].
This is the first appearance of the multipath event in /var/log/messages:
Apr 24 21:14:58 ovirt-node2 kernel: sd 21:0:0:9: [sdgi] tag#77 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=15s
Apr 24 21:14:58 ovirt-node2 kernel: sd 21:0:0:9: [sdgi] tag#77 Sense Key : Aborted Command
[current]
Apr 24 21:14:58 ovirt-node2 kernel: sd 21:0:0:9: [sdgi] tag#77
<<vendor>>ASC=0xc1 ASCQ=0x1
Apr 24 21:14:58 ovirt-node2 kernel: sd 21:0:0:9: [sdgi] tag#77 CDB: Read(16) 88 00 00 00
00 00 79 ac 75 a0 00 00 06 00 00 00
Apr 24 21:14:58 ovirt-node2 kernel: blk_update_request: I/O error, dev sdgi, sector
2041345440 op 0x0:(READ) flags 0x4200 phys_seg 192 prio class 0
Apr 24 21:14:58 ovirt-node2 kernel: device-mapper: multipath: 253:20: Failing path
131:224.
Apr 24 21:14:58 ovirt-node2 multipathd[3044]: sdgi: mark as failed
Apr 24 21:14:58 ovirt-node2 multipathd[3044]: 3600000e00d2c0000002cb4a8000b0000: remaining
active paths: 11
Apr 24 21:15:03 ovirt-node2 multipathd[3044]: 3600000e00d2c0000002cb4a8000b0000: sdgi -
tur checker reports path is up
Apr 24 21:15:03 ovirt-node2 multipathd[3044]: 131:224: reinstated
Apr 24 21:15:03 ovirt-node2 multipathd[3044]: 3600000e00d2c0000002cb4a8000b0000: remaining
active paths: 12
Apr 24 21:15:03 ovirt-node2 kernel: device-mapper: multipath: 253:20: Reinstating path
131:224.
Apr 24 21:15:03 ovirt-node2 kernel: sd 21:0:0:9: alua: port group 8091 state A preferred
supports toluSNA
Apr 24 21:15:03 ovirt-node2 kernel: sd 21:0:0:9: alua: port group 8091 state A preferred
supports toluSNA
Apr 24 21:15:13 ovirt-node2 kernel: sd 13:0:0:9: [sdau] tag#25 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=15s
Apr 24 21:15:13 ovirt-node2 kernel: sd 13:0:0:9: [sdau] tag#25 Sense Key : Aborted Command
[current]
Apr 24 21:15:13 ovirt-node2 kernel: sd 13:0:0:9: [sdau] tag#25
<<vendor>>ASC=0xc1 ASCQ=0x1
Apr 24 21:15:13 ovirt-node2 kernel: sd 13:0:0:9: [sdau] tag#25 CDB: Read(16) 88 00 00 00
00 00 79 ac 75 a0 00 00 06 00 00 00
Apr 24 21:15:13 ovirt-node2 kernel: blk_update_request: I/O error, dev sdau, sector
2041345440 op 0x0:(READ) flags 0x4200 phys_seg 192 prio class 0
Apr 24 21:15:13 ovirt-node2 kernel: device-mapper: multipath: 253:20: Failing path
66:224.
Apr 24 21:15:13 ovirt-node2 multipathd[3044]: sdau: mark as failed
Underlying storage is Fujitsu DX200 S5 with all SSD drives.
Each host has two 10Gbit network adapters dedicated to ISCSI.
Any help with this would be highly appreciated.
Thanks,
Gal Villaret