[ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

Fri Jun 6 03:35:18 EDT 2014

Is this related to the NFS server which gluster provides, or is
because of the way gluster does replication?

There's a few posts ie.
http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/ which
are reporting success with gluster + hosted engine. So it'd be good to
know, so we could possibly try a work around.

Cheers.

On Fri, Jun 6, 2014 at 4:19 PM, Jiri Moskovcak <jmoskovc at redhat.com> wrote:
> I've seen that problem in other threads, the common denominator was "nfs on
> top of gluster". So if you have this setup, then it's a known problem. Or
> you should double check if you hosts have different ids otherwise they would
> be trying to acquire the same lock.
>
> --Jirka
>
>
> On 06/06/2014 08:03 AM, Andrew Lau wrote:
>>
>> Hi Ivan,
>>
>> Thanks for the in depth reply.
>>
>> I've only seen this happen twice, and only after I added a third host
>> to the HA cluster. I wonder if that's the root problem.
>>
>> Have you seen this happen on all your installs or only just after your
>> manual migration? It's a little frustrating this is happening as I was
>> hoping to get this into a production environment. It was all working
>> except that log message :(
>>
>> Thanks,
>> Andrew
>>
>>
>> On Fri, Jun 6, 2014 at 3:20 PM, combuster <combuster at archlinux.us> wrote:
>>>
>>> Hi Andrew,
>>>
>>> this is something that I saw in my logs too, first on one node and then
>>> on
>>> the other three. When that happend on all four of them, engine was
>>> corrupted
>>> beyond repair.
>>>
>>> First of all, I think that message is saying that sanlock can't get a
>>> lock
>>> on the shared storage that you defined for the hostedengine during
>>> installation. I got this error when I've tried to manually migrate the
>>> hosted engine. There is an unresolved bug there and I think it's related
>>> to
>>> this one:
>>>
>>> [Bug 1093366 - Migration of hosted-engine vm put target host score to
>>> zero]
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1093366
>>>
>>> This is a blocker bug (or should be) for the selfhostedengine and, from
>>> my
>>> own experience with it, shouldn't be used in the production enviroment
>>> (not
>>> untill it's fixed).
>>>
>>> Nothing that I've done couldn't fix the fact that the score for the
>>> target
>>> node was Zero, tried to reinstall the node, reboot the node, restarted
>>> several services, tailed a tons of logs etc but to no avail. When only
>>> one
>>> node was left (that was actually running the hosted engine), I brought
>>> the
>>> engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
>>> after
>>> that, when I've tried to start the vm - it wouldn't load. Running VNC
>>> showed
>>> that the filesystem inside the vm was corrupted and when I ran fsck and
>>> finally started up - it was too badly damaged. I succeded to start the
>>> engine itself (after repairing postgresql service that wouldn't want to
>>> start) but the database was damaged enough and acted pretty weird (showed
>>> that storage domains were down but the vm's were running fine etc). Lucky
>>> me, I had already exported all of the VM's on the first sign of trouble
>>> and
>>> then installed ovirt-engine on the dedicated server and attached the
>>> export
>>> domain.
>>>
>>> So while really a usefull feature, and it's working (for the most part
>>> ie,
>>> automatic migration works), manually migrating VM with the hosted-engine
>>> will lead to troubles.
>>>
>>> I hope that my experience with it, will be of use to you. It happened to
>>> me
>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
>>> available.
>>>
>>> Regards,
>>>
>>> Ivan
>>>
>>> On 06/06/2014 05:12 AM, Andrew Lau wrote:
>>>
>>> Hi,
>>>
>>> I'm seeing this weird message in my engine log
>>>
>>> 2014-06-06 03:06:09,380 INFO
>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
>>> 2014-06-06 03:06:12,494 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>> (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1
>>> 2014-06-06 03:06:12,561 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
>>> 62a9d4c1
>>> 2014-06-06 03:06:12,652 INFO
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler_
>>> Worker-89) Correlation ID: null, Call Stack:
>>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
>>> message: internal error Failed to acquire lock: error -243.
>>>
>>> It also appears to occur on the other hosts in the cluster, except the
>>> host which is running the hosted-engine. So right now 3 servers, it
>>> shows up twice in the engine UI.
>>>
>>> The engine VM continues to run peacefully, without any issues on the
>>> host which doesn't have that error.
>>>
>>> Any ideas?
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
>