[ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

Mon Jun 9 11:56:29 UTC 2014

Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).

On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov <alukiano at redhat.com> wrote:
> I just blocked connection to storage for testing, but on result I had this error: "Failed to acquire lock error -243", so I added it in reproduce steps.
> If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them.
> Thanks
>
> ----- Original Message -----
> From: "Andrew Lau" <andrew at andrewklau.com>
> To: "combuster" <combuster at archlinux.us>
> Cc: "users" <users at ovirt.org>
> Sent: Monday, June 9, 2014 3:47:00 AM
> Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
>
> I just ran a few extra tests, I had a 2 host, hosted-engine running
> for a day. They both had a score of 2400. Migrated the VM through the
> UI multiple times, all worked fine. I then added the third host, and
> that's when it all fell to pieces.
> Other two hosts have a score of 0 now.
>
> I'm also curious, in the BZ there's a note about:
>
> where engine-vm block connection to storage domain(via iptables -I
> INPUT -s sd_ip -j DROP)
>
> What's the purpose for that?
>
> On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <andrew at andrewklau.com> wrote:
>> Ignore that, the issue came back after 10 minutes.
>>
>> I've even tried a gluster mount + nfs server on top of that, and the
>> same issue has come back.
>>
>> On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <andrew at andrewklau.com> wrote:
>>> Interesting, I put it all into global maintenance. Shut it all down
>>> for 10~ minutes, and it's regained it's sanlock control and doesn't
>>> seem to have that issue coming up in the log.
>>>
>>> On Fri, Jun 6, 2014 at 4:21 PM, combuster <combuster at archlinux.us> wrote:
>>>> It was pure NFS on a NAS device. They all had different ids (had no
>>>> redeployements of nodes before problem occured).
>>>>
>>>> Thanks Jirka.
>>>>
>>>>
>>>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:
>>>>>
>>>>> I've seen that problem in other threads, the common denominator was "nfs
>>>>> on top of gluster". So if you have this setup, then it's a known problem. Or
>>>>> you should double check if you hosts have different ids otherwise they would
>>>>> be trying to acquire the same lock.
>>>>>
>>>>> --Jirka
>>>>>
>>>>> On 06/06/2014 08:03 AM, Andrew Lau wrote:
>>>>>>
>>>>>> Hi Ivan,
>>>>>>
>>>>>> Thanks for the in depth reply.
>>>>>>
>>>>>> I've only seen this happen twice, and only after I added a third host
>>>>>> to the HA cluster. I wonder if that's the root problem.
>>>>>>
>>>>>> Have you seen this happen on all your installs or only just after your
>>>>>> manual migration? It's a little frustrating this is happening as I was
>>>>>> hoping to get this into a production environment. It was all working
>>>>>> except that log message :(
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster <combuster at archlinux.us> wrote:
>>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> this is something that I saw in my logs too, first on one node and then
>>>>>>> on
>>>>>>> the other three. When that happend on all four of them, engine was
>>>>>>> corrupted
>>>>>>> beyond repair.
>>>>>>>
>>>>>>> First of all, I think that message is saying that sanlock can't get a
>>>>>>> lock
>>>>>>> on the shared storage that you defined for the hostedengine during
>>>>>>> installation. I got this error when I've tried to manually migrate the
>>>>>>> hosted engine. There is an unresolved bug there and I think it's related
>>>>>>> to
>>>>>>> this one:
>>>>>>>
>>>>>>> [Bug 1093366 - Migration of hosted-engine vm put target host score to
>>>>>>> zero]
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1093366
>>>>>>>
>>>>>>> This is a blocker bug (or should be) for the selfhostedengine and, from
>>>>>>> my
>>>>>>> own experience with it, shouldn't be used in the production enviroment
>>>>>>> (not
>>>>>>> untill it's fixed).
>>>>>>>
>>>>>>> Nothing that I've done couldn't fix the fact that the score for the
>>>>>>> target
>>>>>>> node was Zero, tried to reinstall the node, reboot the node, restarted
>>>>>>> several services, tailed a tons of logs etc but to no avail. When only
>>>>>>> one
>>>>>>> node was left (that was actually running the hosted engine), I brought
>>>>>>> the
>>>>>>> engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
>>>>>>> after
>>>>>>> that, when I've tried to start the vm - it wouldn't load. Running VNC
>>>>>>> showed
>>>>>>> that the filesystem inside the vm was corrupted and when I ran fsck and
>>>>>>> finally started up - it was too badly damaged. I succeded to start the
>>>>>>> engine itself (after repairing postgresql service that wouldn't want to
>>>>>>> start) but the database was damaged enough and acted pretty weird
>>>>>>> (showed
>>>>>>> that storage domains were down but the vm's were running fine etc).
>>>>>>> Lucky
>>>>>>> me, I had already exported all of the VM's on the first sign of trouble
>>>>>>> and
>>>>>>> then installed ovirt-engine on the dedicated server and attached the
>>>>>>> export
>>>>>>> domain.
>>>>>>>
>>>>>>> So while really a usefull feature, and it's working (for the most part
>>>>>>> ie,
>>>>>>> automatic migration works), manually migrating VM with the hosted-engine
>>>>>>> will lead to troubles.
>>>>>>>
>>>>>>> I hope that my experience with it, will be of use to you. It happened to
>>>>>>> me
>>>>>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
>>>>>>> available.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Ivan
>>>>>>>
>>>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm seeing this weird message in my engine log
>>>>>>>
>>>>>>> 2014-06-06 03:06:09,380 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>>>>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
>>>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
>>>>>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
>>>>>>> 2014-06-06 03:06:12,494 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
>>>>>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
>>>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
>>>>>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1
>>>>>>> 2014-06-06 03:06:12,561 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
>>>>>>> 62a9d4c1
>>>>>>> 2014-06-06 03:06:12,652 INFO
>>>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>> (DefaultQuartzScheduler_
>>>>>>> Worker-89) Correlation ID: null, Call Stack:
>>>>>>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
>>>>>>> message: internal error Failed to acquire lock: error -243.
>>>>>>>
>>>>>>> It also appears to occur on the other hosts in the cluster, except the
>>>>>>> host which is running the hosted-engine. So right now 3 servers, it
>>>>>>> shows up twice in the engine UI.
>>>>>>>
>>>>>>> The engine VM continues to run peacefully, without any issues on the
>>>>>>> host which doesn't have that error.
>>>>>>>
>>>>>>> Any ideas?
>>>>>>> _______________________________________________
>>>>>>> Users mailing list
>>>>>>> Users at ovirt.org
>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>
>>>>>
>>>>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users