On Mon, Nov 4, 2019 at 9:18 PM Albl, Oliver <Oliver.Albl(a)fabasoft.com> wrote:
Hi all,
I run an oVirt 4.3.6.7-1.el7 installation (50+ hosts, 40+ FC storage domains on two
all-flash arrays) and experienced a problem accessing single storage domains.
What was the last change in the system? upgrade? network change? storage change?
As a result, hosts were taken “not operational” because they could
not see all storage domains, SPM started to move around the hosts.
This is expected if some domain is not accessible on all hosts.
oVirt messages start with:
2019-11-04 15:10:22.739+01 | VDSM HOST082 command SpmStatusVDS failed: (-202,
'Sanlock resource read failure', 'IO timeout')
This means sanlock timed out renewing the lockspace
2019-11-04 15:13:58.836+01 | Host HOST017 cannot access the Storage
Domain(s) HOST_LUN_204 attached to the Data Center <name>. Setting Host state to
Non-Operational.
If a host cannot access all storage domain in the DC, the system set
it to non-operational, and will
probably try to reconnect it later.
2019-11-04 15:15:14.145+01 | Storage domain HOST_LUN_221 experienced
a high latency of 9.60953 seconds from host HOST038. This may cause performance and
functional issues. Please consult your Storage Administrator.
This means reading 4k from start of the metadata lv took 9.6 seconds.
Something in
the way to storage is bad (kernel, network, storage).
The problem mainly affected two storage domains (on the same array)
but I also saw single messages for other storage domains (one the other array as well).
Storage domains stayed available to the hosts, all VMs continued to run.
We 20 seconds (4 retires, 5 seconds per retry) gracetime in multipath
when there are
no active paths, before I/O fails, pausing the VM. We also resume
paused VMs when
storage monitoring works again, so maybe the VM were paused and resumed.
However for storage monitoring we have strict 10 seconds timeout. If
reading from
the metadata lv times out or fail and does not operated normally after
5 minutes, the
domain will become inactive.
When constantly reading from the storage domains (/bin/dd
iflag=direct if=<metadata> bs=4096 count=1 of=/dev/null) we got expected 20+
MBytes/s on all but some storage domains. One of them showed “transfer rates” around 200
Bytes/s, but went up to normal performance from time to time. Transfer rate to this domain
was different between the hosts.
This can explain the read timeouts.
/var/log/messages contain qla2xxx abort messages on almost all hosts.
There are no errors on SAN switches or storage array (but vendor is still investigating).
I did not see high load on the storage array.
The system seemed to stabilize when I stopped all VMs on the affected storage domain and
this storage domain became “inactive”.
This looks the right way to troubleshoot this.
Currently, this storage domain still is inactive and we cannot place
it in maintenance mode (“Failed to deactivate Storage Domain”) nor activate it.
We need vdsm logs to understand this failure.
OVF Metadata seems to be corrupt as well (failed to update OVF disks
<id>, OVF data isn't updated on those OVF stores).
This does not mean OVF is corrupted, only that we could not store new
data. The older data on the other
OVFSTORE disk is probably fine. Hopefuly the system will not try to
write to the other OVFSTORE disk
overwriting the last good version.
The first six 512 byte blocks of /dev/<id>/metadata seem to
contain only zeros.
This is normal, the first 2048 bytes are always zeroes. This area was
used for domain
metadata in older versions.
Any advice on how to proceed here?
Is there a way to recover this storage domain?
Please share more details:
- output of "lsblk"
- output of "multipath -ll"
- output of "/usr/libexec/vdsm/fc-scan -v"
- output of "vgs -o +tags problem-domain-id"
- output of "lvs -o +tags problem-domain-id"
- contents of /etc/multipath.conf
- contents of /etc/multipath.conf.d/*.conf
- /var/log/messages since the issue started
- /var/log/vdsm/vdsm.log* since the issue started on one of the hosts
A bug is probably the best place to keep these logs and make it easy to trac.
Thanks,
Nir