Hi,
I'm running into a strange issue that I've been trying to solve
unsuccessfully for a few days now, and I was hoping someone could offer
some insight.
A few days ago I needed to reboot the the server that hosts the management
engine and is also a node in the system. Following proper procedure I
selected a new host to be the SPM, migrated VMs off the Host, and put it
into maintenance mode. After the host came back up (and the management
engine was back online) I noticed that one of my VMs had halted on a
storage error, to rule out the SPM being the issue I asked oVirt to select
a new SPM and it was stuck in a contending loop where each host tries to
contend for SPM status but ultimately fails (every other VM also halted by
this point). The error was "BlockSD master file system FSCK error", after
researching the error I found a post on this list that was the same error
and the author said that a simple FSCK on the offending file system fixed
his issue. I had to force shutdown every VM from the halted state and put
all but one host into maintenance mode. On that host I ran a FSCK on the
offending volume which found a lot of short read errors which it fixed and
afterwards the contending loop was broke and hosts could now successfully
be an SPM.
Now every VM halts on start or resume, even ones that were offline at the
time of the earlier incident, with a Storage Error "abnormal vm stop device
virtio-disk0 error eio". I can't even create new disks because it fails
with an error. I've attached what I think is the relevant VDSM log portion
of a VM trying to resume, if more is needed please just let me know.
I'm worried FSCK and I mangled the file system, and have no idea how to
repair it.
Any insight is greatly appreciated.
Thank you,
Dave
oVirt Engine Version: 3.6.2.6-1.el7.centos
Storage Tech: iSCSI
4 Servers each with:
OS: RHEL - 7 - 2.1511.el7.centos.2.10
Kernel: 3.10.0 - 327.4.5.el7.x86_64
KVM Version: 2.3.0 - 31.el7_2.7.1
libVirt Version: libvirt-1.2.17-13.el7_2.5
VDSM Version: vdsm-4.17.28-0.el7.centos