[Users] VM crashes and doesn't recover

Yuval M yuvalme at gmail.com
Wed Apr 10 20:27:11 UTC 2013


I am seeing the errors appear even without any VM activity:

Apr 10 23:24:24 bufferoverflow vdsm TaskManager.Task ERROR
Task=`04370dcb-a823-4485-8d7e-b4cfc75905a0`::Unexpected error
Apr 10 23:24:24 bufferoverflow vdsm Storage.Dispatcher.Protect ERROR
{'status': {'message': "Unknown pool id, pool not connected:
('5849b030-626e-47cb-ad90-3ce782d831b3',)", 'code': 309}}
Apr 10 23:24:24 bufferoverflow vdsm TaskManager.Task ERROR
Task=`354009d2-e7c1-4558-b947-0e3a19ab5490`::Unexpected error
Apr 10 23:24:24 bufferoverflow vdsm Storage.Dispatcher.Protect ERROR
{'status': {'message': "Unknown pool id, pool not connected:
('5849b030-626e-47cb-ad90-3ce782d831b3',)", 'code': 309}}
Apr 10 23:24:25 bufferoverflow kernel: [ 9136.829062] ata1: hard resetting
link
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.135381] ata1: SATA link up
6.0 Gbps (SStatus 133 SControl 300)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.146797] ACPI Error: [DSSP]
Namespace lookup failure, AE_NOT_FOUND (20120711/psargs-359)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.146805] ACPI Error: Method
parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff880407c74d70),
AE_NOT_FOUND (20120711/psparse-536)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.156747] ACPI Error: [DSSP]
Namespace lookup failure, AE_NOT_FOUND (20120711/psargs-359)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.156755] ACPI Error: Method
parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff880407c74d70),
AE_NOT_FOUND (20120711/psparse-536)
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.157350] ata1.00: configured
for UDMA/133
Apr 10 23:24:25 bufferoverflow kernel: [ 9137.157355] ata1: EH complete
Apr 10 23:24:29 bufferoverflow kernel: [ 9140.856010] device-mapper: table:
253:0: multipath: error getting device
Apr 10 23:24:29 bufferoverflow kernel: [ 9140.856013] device-mapper: ioctl:
error adding target to table
Apr 10 23:24:29 bufferoverflow kernel: [ 9140.856534] device-mapper: table:
253:0: multipath: error getting device
Apr 10 23:24:29 bufferoverflow kernel: [ 9140.856536] device-mapper: ioctl:
error adding target to table
Apr 10 23:24:29 bufferoverflow multipathd: dm-0: remove map (uevent)
Apr 10 23:24:29 bufferoverflow multipathd: dm-0: remove map (uevent)
Apr 10 23:24:29 bufferoverflow multipathd: dm-0: remove map (uevent)
Apr 10 23:24:29 bufferoverflow multipathd: dm-0: remove map (uevent)
Apr 10 23:24:29 bufferoverflow vdsm Storage.LVM WARNING lvm vgs failed: 5
[] ['  Volume group "1083422e-a5db-41b6-b667-b9ef1ef244f0" not found']
Apr 10 23:24:29 bufferoverflow vdsm TaskManager.Task ERROR
Task=`0248943a-6acd-496b-932e-b236920932f0`::Unexpected error
Apr 10 23:24:29 bufferoverflow vdsm Storage.Dispatcher.Protect ERROR
{'status': {'message': "Cannot find master domain:
'spUUID=5849b030-626e-47cb-ad90-3ce782d831b3,
msdUUID=1083422e-a5db-41b6-b667-b9ef1ef244f0'", 'code': 304}}
Apr 10 23:24:29 bufferoverflow vdsm Storage.HSM WARNING disconnect sp:
5849b030-626e-47cb-ad90-3ce782d831b3 failed. Known pools {}
Apr 10 23:24:29 bufferoverflow vdsm Storage.LVM WARNING lvm vgs failed: 5
[] ['  Volume group "a8286508-db45-40d7-8645-e573f6bacdc7" not found']
Apr 10 23:24:29 bufferoverflow vdsm TaskManager.Task ERROR
Task=`c1ea06bf-8046-4cb4-88c7-00337051d713`::Unexpected error
Apr 10 23:24:29 bufferoverflow vdsm Storage.Dispatcher.Protect ERROR
{'status': {'message': "Storage domain does not exist:
('a8286508-db45-40d7-8645-e573f6bacdc7',)", 'code': 358}}



I have attached the full logs.

I suspect this is some Fedora <--> SSD bug that may be unrelated to ovirt.
I'm going to work around that for now by moving my VM's storage to a
magnetic disk on another server.


Yuval Meir




On Tue, Apr 9, 2013 at 11:31 AM, Federico Simoncelli <fsimonce at redhat.com>wrote:

> ----- Original Message -----
> > From: "Yuval M" <yuvalme at gmail.com>
> > To: "Dan Kenigsberg" <danken at redhat.com>
> > Cc: users at ovirt.org, "Nezer Zaidenberg" <nzaidenberg at mac.com>
> > Sent: Friday, March 29, 2013 2:19:43 PM
> > Subject: Re: [Users] VM crashes and doesn't recover
> >
> > Any ideas on what can cause that storage crash?
> > could it be related to using a SSD?
>
> What the logs say is that the IO on the storage domain are failing (both
> the oop timeouts and the sanlock log) and this triggers the VDSM restart.
>
> > On Sun, Mar 24, 2013 at 09:50:02PM +0200, Yuval M wrote:
> > > I am running vdsm from packages as my interest is in developing for the
> > > I noticed that when the storage domain crashes I can't even do "df -h"
> > > (hangs)
>
> This is also consistent with the unreachable domain.
>
> The dmesg log that you attached doesn't contain timestamps so it's hard to
> correlate with the rest.
>
> If you want you can try to reproduce the issue and resubmit the logs:
>
> /var/log/vdsm/vdsm.log
> /var/log/sanlock.log
> /var/log/messages
>
> (Maybe stating also the exact time when the issue begins to appear)
>
> In the logs I noticed that you're using only one NFS domain, and I think
> that
> the SSD (on the storage side) shouldn't be a problem. When you experience
> such
> failure are you able to read/write from/to the SSD on machine that is
> serving
> the share? (If it's the same machine check that using the "real" path where
> it's mounted, not the nfs share)
>
> --
> Federico
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130410/44110cec/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ata-reset-log.zip
Type: application/zip
Size: 979546 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130410/44110cec/attachment-0001.zip>


More information about the Users mailing list