[Users] VM crashes and doesn't recover

Sun Mar 24 09:40:19 UTC 2013

sanlock is at the latest version (this solved another problem we had a few
days ago):

$ rpm -q sanlock
sanlock-2.6-7.fc18.x86_64

the storage is on the same machine as the engine and vdsm.
iptables is up but there is a rule to allow all localhost traffic.

On Sun, Mar 24, 2013 at 11:34 AM, Maor Lipchuk <mlipchuk at redhat.com> wrote:

> From the VDSM log, it seems that the master storage domain was not
> responding.
>
> Thread-23::DEBUG::2013-03-22
>
> 18:50:20,263::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain)
> Domain 1083422e-a5db-41b6-b667-b9ef1ef244f0 changed its status to Invalid
> ....
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in
> _monitorDomain
>     self.domain.selftest()
>   File "/usr/share/vdsm/storage/nfsSD.py", line 108, in selftest
>     fileSD.FileStorageDomain.selftest(self)
>   File "/usr/share/vdsm/storage/fileSD.py", line 480, in selftest
>     self.oop.os.statvfs(self.domaindir)
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 280, in
> callCrabRPCFunction
>     *args, **kwargs)
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in
> callCrabRPCFunction
>     rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in
> _recvAll
>     raise Timeout()
> Timeout
> .....
>
> I'm also see a san lock issue, but I think that is because the storage
> could not be reached:
> ReleaseHostIdFailure: Cannot release host id:
> ('1083422e-a5db-41b6-b667-b9ef1ef244f0', SanlockException(16, 'Sanlock
> lockspace remove failure', 'Device or resource busy'))
>
> Can you try to see if the ip tables are running on your host, and if so,
> please check if it is blocking the storage server by any chance?
> Can you try to manually mount this NFS and see if it works?
> Is it possible the storage server got connectivity issues?
>
>
> Regards,
> Maor
>
> On 03/22/2013 08:24 PM, Limor Gavish wrote:
> > Hello,
> >
> > I am using Ovirt 3.2 on Fedora 18:
> > [wil at bufferoverflow ~]$ rpm -q vdsm
> > vdsm-4.10.3-7.fc18.x86_64
> >
> > (the engine is built from sources).
> >
> > I seem to have hit this bug:
> > https://bugzilla.redhat.com/show_bug.cgi?id=922515
> >
> > in the following configuration:
> > Single host (no migrations)
> > Created a VM, installed an OS inside (Fedora18)
> > stopped the VM.
> > created template from it.
> > Created an additional VM from the template using thin provision.
> > Started the second VM.
> >
> > in addition to the errors in the logs the storage domains (both data and
> > ISO) crashed, i.e went to "unknown" and "inactive" states respectively.
> > (see the attached engine.log)
> >
> > I attached the VDSM and engine logs.
> >
> > is there a way to work around this problem?
> > It happens repeatedly.
> >
> > Yuval Meir
> >
> >
> >
> > _______________________________________________
> > Users mailing list
> > Users at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/users
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130324/c21cab67/attachment-0001.html>