Re: [ovirt-users] VMs halt on Storage Error

Monday, 27 June 2016

On Mon, Jun 27, 2016 at 6:32 PM, David Gerner <davegerner1975(a)gmail.com&gt; wrote:
...
 Hi,

 I'm running into a strange issue that I've been trying to solve
 unsuccessfully for a few days now, and I was hoping someone could offer some
 insight.

 A few days ago I needed to reboot the the server that hosts the management
 engine and is also a node in the system. Following proper procedure I
 selected a new host to be the SPM, migrated VMs off the Host, and put it
 into maintenance mode. After the host came back up (and the management
 engine was back online) I noticed that one of my VMs had halted on a storage
 error, to rule out the SPM being the issue I asked oVirt to select a new SPM
 and it was stuck in a contending loop where each host tries to contend for
 SPM status but ultimately fails (every other VM also halted by this point).
 The error was "BlockSD master file system FSCK error", after researching the
 error I found a post on this list that was the same error and the author
 said that a simple FSCK on the offending file system fixed his issue. I had
 to force shutdown every VM from the halted state and put all but one host
 into maintenance mode. On that host I ran a FSCK on the offending volume
 which found a lot of short read errors which it fixed and afterwards the
 contending loop was broke and hosts could now successfully be an SPM. 
This means that the master file system is probably ok on the current master
domain, which may be different domain - do you have more than one storage
domain?

...
 Now every VM halts on start or resume, even ones that were offline at
the
 time of the earlier incident, with a Storage Error "abnormal vm stop device
 virtio-disk0 error eio". I can't even create new disks because it fails with
 an error. I've attached what I think is the relevant VDSM log portion of a
 VM trying to resume, if more is needed please just let me know. 
We need full vdsm logs and engine logs to understand such errors.

I think the best thing would be to open a bug and attach these logs.

...
 I'm worried FSCK and I mangled the file system, and have no idea
how to
 repair it. 
It can be useful to investigate the master that failed and you fixed using fsck.

You can copy this lv contents like this:

1. Make sure the domain is not the master domain, and that the master
    lv is not mounted.

If the lv is mounted, you can force another domain to be the master
domain by deactivating this domain. Since your vms are not running,
this should not be a problem.

If you deactivated this domain, you may need to connect to it again,
you can do this using iscsiadm.

If you connect to the target manually using iscsiadm, do delete the node
when you finish, so it will not be connected automatically on reboot.

2. Activate the master lv if needed:

    lvchange -ay vgname/master

Warning: do not mount this lv

3. Copy the lv using dd to other storage:

    dd if=/dev/vgname/master of=master.backup bs=1M iflag=direct oflag=direct

4. Deactivate the master lv

    lvchange -an vgname/master

5. Compress the backup

And share the image.

Nir

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [ovirt-users] VMs halt on Storage Error