On Wed, Feb 22, 2017 at 9:32 AM, Nir Soffer <nsoffer@redhat.com> wrote:
On Wed, Feb 22, 2017 at 10:31 AM, Nir Soffer <nsoffer@redhat.com> wrote:

>
> This means that sanlock could not initialize a lease in the new volume created
> for the snapshot.
>
> Can you attach sanlock.log?

Found it in your next message


OK.
Just to recap what happened from a physical point of view:

- apparently I had an array of disks with no more spare disks and on this array was the LUN composing the disk storage domain.
So I was in involved in moving disks of the impacted storage domain and then removal of storage domain itself, so that we can remove the logical array on storage
This is a test storage system without support so at the moment I had no more spare disks on it

- actually there was another disk problem with the array, generating loss of data because of no more spare available at that time

- No evidence of error at VM OS level and at storage domain level

- But probably the 2 operations:
1) move disk
2) create snapshot of the VM containing the disk
could not complete due to this low level problem

It should be nice to find an evidence to this. Storage domain didn't go offline BTW

- I got confirmation of the loss of data this way:
The original disk of the VM, inside the VM, was a PV of a VG
I added a disk (on another storage domain) to the VM, made it a PV and added to the original VG
Tried pvmove from source disk to new disk, but it reached about 47% and then stopped/failed, pausing the VM.
I could start again the VM but as soon as the pvmove continued, the VM came back to paused state.
So I powered off the VM and was able to detach/delete the corrupted disk and then remove the storage domain (see other thread opened yesterday)

I then managed to recover the now corrupted VG and restore from backup the data contained in original fs.

So the original problem was low level error of storage.
If can be of help to narrow down oVirt behavior in this case scenario I can provide further logs from VM OS or from hosts/engine.
Let me know.

Some questions:
- how is it managed the reaction of putting VM in paused mode due to I/O error as in this case? Can I in some way manage to keep VM on a ndlet it generate errors as in real physical server or not? 
- Why I didn't get any message at storage domain level but only at VM disk level?

Thanks for the given help
Gianluca