[ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

Tue Jun 10 12:18:50 UTC 2014

Ok, I thought I was doing something wrong yesterday and just
tore down my 3-node cluster with the hosted engine and started
rebuilding.  I was seeing essentially the same thing, a score of
0 on the VMs not running the engine, it wouldn't allow migration of
the hosted engine.   I played with all things related to setting
maintenance and rebooting hosts, nothing brought them up to a
point where I could migrate the hosted engine.

I thought it was related to ovirt messing up when deploying the
other hosts (I told it not to modify the firewall that I disabled,
but the deploy process forcibly reenabled the firewall which gluster
really didn't like).  Now after reading this it appears my assumption
may be false.

Previously a 2-node cluster I had worked fine, but I wanted to
go to 3-nodes so I could enable quorum on gluster to not risk
split-brain issues.

-Brad

On 6/10/14 1:19 AM, Andrew Lau wrote:
> I'm really having a hard time finding out why it's happening..
>
> If I set the cluster to global for a minute or two, the scores will
> reset back to 2400. Set maintenance mode to none, and all will be fine
> until a migration occurs. It seems it tries to migrate, fails and sets
> the score to 0 permanently rather than the 10? minutes mentioned in
> one of the ovirt slides.
>
> When I have two hosts, it's score 0 only when a migration occurs.
> (Just on the host which doesn't have engine up). The score 0 only
> happens when it's tried to migrate when I set the host to local
> maintenance. Migrating the VM from the UI has worked quite a few
> times, but it's recently started to fail.
>
> When I have three hosts, after 5~ mintues of them all up the score
> will hit 0 on the hosts not running the VMs. It doesn't even have to
> attempt to migrate before the score goes to 0. Stopping the ha agent
> on one host, and "resetting" it with the global maintenance method
> brings it back to the 2 host scenario above.
>
> I may move on and just go back to a standalone engine as this is not
> getting very much luck..
>
> On Tue, Jun 10, 2014 at 3:11 PM, combuster <combuster at archlinux.us> wrote:
>> Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS
>> device as the NFS share itself, before the deploy procedure even started.
>> But I'm puzzled at how you can reproduce the bug, all was well on my setup
>> before I've stated manual migration of the engine's vm. Even auto migration
>> worked before that (tested it). Does it just happen without any procedure on
>> the engine itself? Is the score 0 for just one node, or two of three of
>> them?
>>
>> On 06/10/2014 01:02 AM, Andrew Lau wrote:
>>>
>>> nvm, just as I hit send the error has returned.
>>> Ignore this..
>>>
>>> On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau <andrew at andrewklau.com> wrote:
>>>>
>>>> So after adding the L3 capabilities to my storage network, I'm no
>>>> longer seeing this issue anymore. So the engine needs to be able to
>>>> access the storage domain it sits on? But that doesn't show up in the
>>>> UI?
>>>>
>>>> Ivan, was this also the case with your setup? Engine couldn't access
>>>> storage domain?
>>>>
>>>> On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau <andrew at andrewklau.com> wrote:
>>>>>
>>>>> Interesting, my storage network is a L2 only and doesn't run on the
>>>>> ovirtmgmt (which is the only thing HostedEngine sees) but I've only
>>>>> seen this issue when running ctdb in front of my NFS server. I
>>>>> previously was using localhost as all my hosts had the nfs server on
>>>>> it (gluster).
>>>>>
>>>>> On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov <alukiano at redhat.com>
>>>>> wrote:
>>>>>>
>>>>>> I just blocked connection to storage for testing, but on result I had
>>>>>> this error: "Failed to acquire lock error -243", so I added it in reproduce
>>>>>> steps.
>>>>>> If you know another steps to reproduce this error, without blocking
>>>>>> connection to storage it also can be wonderful if you can provide them.
>>>>>> Thanks
>>>>>>
>>>>>> ----- Original Message -----
>>>>>> From: "Andrew Lau" <andrew at andrewklau.com>
>>>>>> To: "combuster" <combuster at archlinux.us>
>>>>>> Cc: "users" <users at ovirt.org>
>>>>>> Sent: Monday, June 9, 2014 3:47:00 AM
>>>>>> Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message:
>>>>>> internal error Failed to acquire lock error -243
>>>>>>
>>>>>> I just ran a few extra tests, I had a 2 host, hosted-engine running
>>>>>> for a day. They both had a score of 2400. Migrated the VM through the
>>>>>> UI multiple times, all worked fine. I then added the third host, and
>>>>>> that's when it all fell to pieces.
>>>>>> Other two hosts have a score of 0 now.
>>>>>>
>>>>>> I'm also curious, in the BZ there's a note about:
>>>>>>
>>>>>> where engine-vm block connection to storage domain(via iptables -I
>>>>>> INPUT -s sd_ip -j DROP)
>>>>>>
>>>>>> What's the purpose for that?
>>>>>>
>>>>>> On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <andrew at andrewklau.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Ignore that, the issue came back after 10 minutes.
>>>>>>>
>>>>>>> I've even tried a gluster mount + nfs server on top of that, and the
>>>>>>> same issue has come back.
>>>>>>>
>>>>>>> On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <andrew at andrewklau.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Interesting, I put it all into global maintenance. Shut it all down
>>>>>>>> for 10~ minutes, and it's regained it's sanlock control and doesn't
>>>>>>>> seem to have that issue coming up in the log.
>>>>>>>>
>>>>>>>> On Fri, Jun 6, 2014 at 4:21 PM, combuster <combuster at archlinux.us>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> It was pure NFS on a NAS device. They all had different ids (had no
>>>>>>>>> redeployements of nodes before problem occured).
>>>>>>>>>
>>>>>>>>> Thanks Jirka.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:
>>>>>>>>>>
>>>>>>>>>> I've seen that problem in other threads, the common denominator was
>>>>>>>>>> "nfs
>>>>>>>>>> on top of gluster". So if you have this setup, then it's a known
>>>>>>>>>> problem. Or
>>>>>>>>>> you should double check if you hosts have different ids otherwise
>>>>>>>>>> they would
>>>>>>>>>> be trying to acquire the same lock.
>>>>>>>>>>
>>>>>>>>>> --Jirka
>>>>>>>>>>
>>>>>>>>>> On 06/06/2014 08:03 AM, Andrew Lau wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Ivan,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the in depth reply.
>>>>>>>>>>>
>>>>>>>>>>> I've only seen this happen twice, and only after I added a third
>>>>>>>>>>> host
>>>>>>>>>>> to the HA cluster. I wonder if that's the root problem.
>>>>>>>>>>>
>>>>>>>>>>> Have you seen this happen on all your installs or only just after
>>>>>>>>>>> your
>>>>>>>>>>> manual migration? It's a little frustrating this is happening as I
>>>>>>>>>>> was
>>>>>>>>>>> hoping to get this into a production environment. It was all
>>>>>>>>>>> working
>>>>>>>>>>> except that log message :(
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster <combuster at archlinux.us>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>> this is something that I saw in my logs too, first on one node
>>>>>>>>>>>> and then
>>>>>>>>>>>> on
>>>>>>>>>>>> the other three. When that happend on all four of them, engine
>>>>>>>>>>>> was
>>>>>>>>>>>> corrupted
>>>>>>>>>>>> beyond repair.
>>>>>>>>>>>>
>>>>>>>>>>>> First of all, I think that message is saying that sanlock can't
>>>>>>>>>>>> get a
>>>>>>>>>>>> lock
>>>>>>>>>>>> on the shared storage that you defined for the hostedengine
>>>>>>>>>>>> during
>>>>>>>>>>>> installation. I got this error when I've tried to manually
>>>>>>>>>>>> migrate the
>>>>>>>>>>>> hosted engine. There is an unresolved bug there and I think it's
>>>>>>>>>>>> related
>>>>>>>>>>>> to
>>>>>>>>>>>> this one:
>>>>>>>>>>>>
>>>>>>>>>>>> [Bug 1093366 - Migration of hosted-engine vm put target host
>>>>>>>>>>>> score to
>>>>>>>>>>>> zero]
>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1093366
>>>>>>>>>>>>
>>>>>>>>>>>> This is a blocker bug (or should be) for the selfhostedengine
>>>>>>>>>>>> and, from
>>>>>>>>>>>> my
>>>>>>>>>>>> own experience with it, shouldn't be used in the production
>>>>>>>>>>>> enviroment
>>>>>>>>>>>> (not
>>>>>>>>>>>> untill it's fixed).
>>>>>>>>>>>>
>>>>>>>>>>>> Nothing that I've done couldn't fix the fact that the score for
>>>>>>>>>>>> the
>>>>>>>>>>>> target
>>>>>>>>>>>> node was Zero, tried to reinstall the node, reboot the node,
>>>>>>>>>>>> restarted
>>>>>>>>>>>> several services, tailed a tons of logs etc but to no avail. When
>>>>>>>>>>>> only
>>>>>>>>>>>> one
>>>>>>>>>>>> node was left (that was actually running the hosted engine), I
>>>>>>>>>>>> brought
>>>>>>>>>>>> the
>>>>>>>>>>>> engine's vm down gracefully (hosted-engine --vm-shutdown I
>>>>>>>>>>>> belive) and
>>>>>>>>>>>> after
>>>>>>>>>>>> that, when I've tried to start the vm - it wouldn't load. Running
>>>>>>>>>>>> VNC
>>>>>>>>>>>> showed
>>>>>>>>>>>> that the filesystem inside the vm was corrupted and when I ran
>>>>>>>>>>>> fsck and
>>>>>>>>>>>> finally started up - it was too badly damaged. I succeded to
>>>>>>>>>>>> start the
>>>>>>>>>>>> engine itself (after repairing postgresql service that wouldn't
>>>>>>>>>>>> want to
>>>>>>>>>>>> start) but the database was damaged enough and acted pretty weird
>>>>>>>>>>>> (showed
>>>>>>>>>>>> that storage domains were down but the vm's were running fine
>>>>>>>>>>>> etc).
>>>>>>>>>>>> Lucky
>>>>>>>>>>>> me, I had already exported all of the VM's on the first sign of
>>>>>>>>>>>> trouble
>>>>>>>>>>>> and
>>>>>>>>>>>> then installed ovirt-engine on the dedicated server and attached
>>>>>>>>>>>> the
>>>>>>>>>>>> export
>>>>>>>>>>>> domain.
>>>>>>>>>>>>
>>>>>>>>>>>> So while really a usefull feature, and it's working (for the most
>>>>>>>>>>>> part
>>>>>>>>>>>> ie,
>>>>>>>>>>>> automatic migration works), manually migrating VM with the
>>>>>>>>>>>> hosted-engine
>>>>>>>>>>>> will lead to troubles.
>>>>>>>>>>>>
>>>>>>>>>>>> I hope that my experience with it, will be of use to you. It
>>>>>>>>>>>> happened to
>>>>>>>>>>>> me
>>>>>>>>>>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no
>>>>>>>>>>>> fix
>>>>>>>>>>>> available.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Ivan
>>>>>>>>>>>>
>>>>>>>>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm seeing this weird message in my engine log
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-06-06 03:06:09,380 INFO
>>>>>>>>>>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>>>>>>>>>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
>>>>>>>>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on
>>>>>>>>>>>> vds
>>>>>>>>>>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is
>>>>>>>>>>>> done
>>>>>>>>>>>> 2014-06-06 03:06:12,494 INFO
>>>>>>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>>>>>>> (DefaultQuartzScheduler_Worker-89) START,
>>>>>>>>>>>> DestroyVDSCommand(HostName =
>>>>>>>>>>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
>>>>>>>>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
>>>>>>>>>>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1
>>>>>>>>>>>> 2014-06-06 03:06:12,561 INFO
>>>>>>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>>>>>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log
>>>>>>>>>>>> id:
>>>>>>>>>>>> 62a9d4c1
>>>>>>>>>>>> 2014-06-06 03:06:12,652 INFO
>>>>>>>>>>>>
>>>>>>>>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>>>>>>> (DefaultQuartzScheduler_
>>>>>>>>>>>> Worker-89) Correlation ID: null, Call Stack:
>>>>>>>>>>>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
>>>>>>>>>>>> message: internal error Failed to acquire lock: error -243.
>>>>>>>>>>>>
>>>>>>>>>>>> It also appears to occur on the other hosts in the cluster,
>>>>>>>>>>>> except the
>>>>>>>>>>>> host which is running the hosted-engine. So right now 3 servers,
>>>>>>>>>>>> it
>>>>>>>>>>>> shows up twice in the engine UI.
>>>>>>>>>>>>
>>>>>>>>>>>> The engine VM continues to run peacefully, without any issues on
>>>>>>>>>>>> the
>>>>>>>>>>>> host which doesn't have that error.
>>>>>>>>>>>>
>>>>>>>>>>>> Any ideas?
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Users mailing list
>>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Users mailing list
>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>
>>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>