I am seeing the errors appear even without any VM activity:

Apr 10 23:24:24 bufferoverflow vdsm TaskManager.Task ERROR Task=`04370dcb-a823-4485-8d7e-b4cfc75905a0`::Unexpected error
Apr 10 23:24:24 bufferoverflow vdsm Storage.Dispatcher.Protect ERROR {'status': {'message': "Unknown pool id, pool not connected: ('5849b030-626e-47cb-ad90-3ce782d831b3',)", 'code': 309}}
Apr 10 23:24:24 bufferoverflow vdsm TaskManager.Task ERROR Task=`354009d2-e7c1-4558-b947-0e3a19ab5490`::Unexpected error
Apr 10 23:24:24 bufferoverflow vdsm Storage.Dispatcher.Protect ERROR {'status': {'message': "Unknown pool id, pool not connected: ('5849b030-626e-47cb-ad90-3ce782d831b3',)", 'code': 309}}
Apr 10 23:24:25 bufferoverflow kernel: [ 9136.829062] ata1: hard resetting link
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.135381] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.146797] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120711/psargs-359)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.146805] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff880407c74d70), AE_NOT_FOUND (20120711/psparse-536)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.156747] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120711/psargs-359)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.156755] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff880407c74d70), AE_NOT_FOUND (20120711/psparse-536)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.157350] ata1.00: configured for UDMA/133
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.157355] ata1: EH complete
Apr 10 23:24:29 bufferoverflow kernel: [ 9140.856010] device-mapper: table: 253:0: multipath: error getting device
Apr 10 23:24:29 bufferoverflow kernel: [ 9140.856013] device-mapper: ioctl: error adding target to table
Apr 10 23:24:29 bufferoverflow kernel: [ 9140.856534] device-mapper: table: 253:0: multipath: error getting device
Apr 10 23:24:29 bufferoverflow kernel: [ 9140.856536] device-mapper: ioctl: error adding target to table
Apr 10 23:24:29 bufferoverflow multipathd: dm-0: remove map (uevent)
Apr 10 23:24:29 bufferoverflow multipathd: dm-0: remove map (uevent)
Apr 10 23:24:29 bufferoverflow multipathd: dm-0: remove map (uevent)
Apr 10 23:24:29 bufferoverflow multipathd: dm-0: remove map (uevent)
Apr 10 23:24:29 bufferoverflow vdsm Storage.LVM WARNING lvm vgs failed: 5 [] ['  Volume group "1083422e-a5db-41b6-b667-b9ef1ef244f0" not found']
Apr 10 23:24:29 bufferoverflow vdsm TaskManager.Task ERROR Task=`0248943a-6acd-496b-932e-b236920932f0`::Unexpected error
Apr 10 23:24:29 bufferoverflow vdsm Storage.Dispatcher.Protect ERROR {'status': {'message': "Cannot find master domain: 'spUUID=5849b030-626e-47cb-ad90-3ce782d831b3, msdUUID=1083422e-a5db-41b6-b667-b9ef1ef244f0'", 'code': 304}}
Apr 10 23:24:29 bufferoverflow vdsm Storage.HSM WARNING disconnect sp: 5849b030-626e-47cb-ad90-3ce782d831b3 failed. Known pools {}
Apr 10 23:24:29 bufferoverflow vdsm Storage.LVM WARNING lvm vgs failed: 5 [] ['  Volume group "a8286508-db45-40d7-8645-e573f6bacdc7" not found']
Apr 10 23:24:29 bufferoverflow vdsm TaskManager.Task ERROR Task=`c1ea06bf-8046-4cb4-88c7-00337051d713`::Unexpected error
Apr 10 23:24:29 bufferoverflow vdsm Storage.Dispatcher.Protect ERROR {'status': {'message': "Storage domain does not exist: ('a8286508-db45-40d7-8645-e573f6bacdc7',)", 'code': 358}}



I have attached the full logs.

I suspect this is some Fedora <--> SSD bug that may be unrelated to ovirt.
I'm going to work around that for now by moving my VM's storage to a magnetic disk on another server.


Yuval Meir




On Tue, Apr 9, 2013 at 11:31 AM, Federico Simoncelli <fsimonce@redhat.com> wrote:
----- Original Message -----
> From: "Yuval M" <yuvalme@gmail.com>
> To: "Dan Kenigsberg" <danken@redhat.com>
> Cc: users@ovirt.org, "Nezer Zaidenberg" <nzaidenberg@mac.com>
> Sent: Friday, March 29, 2013 2:19:43 PM
> Subject: Re: [Users] VM crashes and doesn't recover
>
> Any ideas on what can cause that storage crash?
> could it be related to using a SSD?

What the logs say is that the IO on the storage domain are failing (both
the oop timeouts and the sanlock log) and this triggers the VDSM restart.

> On Sun, Mar 24, 2013 at 09:50:02PM +0200, Yuval M wrote:
> > I am running vdsm from packages as my interest is in developing for the
> > I noticed that when the storage domain crashes I can't even do "df -h"
> > (hangs)

This is also consistent with the unreachable domain.

The dmesg log that you attached doesn't contain timestamps so it's hard to
correlate with the rest.

If you want you can try to reproduce the issue and resubmit the logs:

/var/log/vdsm/vdsm.log
/var/log/sanlock.log
/var/log/messages

(Maybe stating also the exact time when the issue begins to appear)

In the logs I noticed that you're using only one NFS domain, and I think that
the SSD (on the storage side) shouldn't be a problem. When you experience such
failure are you able to read/write from/to the SSD on machine that is serving
the share? (If it's the same machine check that using the "real" path where
it's mounted, not the nfs share)

--
Federico