[ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
Andrew Lau
andrew at andrewklau.com
Mon Jun 9 19:02:02 EDT 2014
nvm, just as I hit send the error has returned.
Ignore this..
On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau <andrew at andrewklau.com> wrote:
> So after adding the L3 capabilities to my storage network, I'm no
> longer seeing this issue anymore. So the engine needs to be able to
> access the storage domain it sits on? But that doesn't show up in the
> UI?
>
> Ivan, was this also the case with your setup? Engine couldn't access
> storage domain?
>
> On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau <andrew at andrewklau.com> wrote:
>> Interesting, my storage network is a L2 only and doesn't run on the
>> ovirtmgmt (which is the only thing HostedEngine sees) but I've only
>> seen this issue when running ctdb in front of my NFS server. I
>> previously was using localhost as all my hosts had the nfs server on
>> it (gluster).
>>
>> On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov <alukiano at redhat.com> wrote:
>>> I just blocked connection to storage for testing, but on result I had this error: "Failed to acquire lock error -243", so I added it in reproduce steps.
>>> If you know another steps to reproduce this error, without blocking connection to storage it also can be wonderful if you can provide them.
>>> Thanks
>>>
>>> ----- Original Message -----
>>> From: "Andrew Lau" <andrew at andrewklau.com>
>>> To: "combuster" <combuster at archlinux.us>
>>> Cc: "users" <users at ovirt.org>
>>> Sent: Monday, June 9, 2014 3:47:00 AM
>>> Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243
>>>
>>> I just ran a few extra tests, I had a 2 host, hosted-engine running
>>> for a day. They both had a score of 2400. Migrated the VM through the
>>> UI multiple times, all worked fine. I then added the third host, and
>>> that's when it all fell to pieces.
>>> Other two hosts have a score of 0 now.
>>>
>>> I'm also curious, in the BZ there's a note about:
>>>
>>> where engine-vm block connection to storage domain(via iptables -I
>>> INPUT -s sd_ip -j DROP)
>>>
>>> What's the purpose for that?
>>>
>>> On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <andrew at andrewklau.com> wrote:
>>>> Ignore that, the issue came back after 10 minutes.
>>>>
>>>> I've even tried a gluster mount + nfs server on top of that, and the
>>>> same issue has come back.
>>>>
>>>> On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <andrew at andrewklau.com> wrote:
>>>>> Interesting, I put it all into global maintenance. Shut it all down
>>>>> for 10~ minutes, and it's regained it's sanlock control and doesn't
>>>>> seem to have that issue coming up in the log.
>>>>>
>>>>> On Fri, Jun 6, 2014 at 4:21 PM, combuster <combuster at archlinux.us> wrote:
>>>>>> It was pure NFS on a NAS device. They all had different ids (had no
>>>>>> redeployements of nodes before problem occured).
>>>>>>
>>>>>> Thanks Jirka.
>>>>>>
>>>>>>
>>>>>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:
>>>>>>>
>>>>>>> I've seen that problem in other threads, the common denominator was "nfs
>>>>>>> on top of gluster". So if you have this setup, then it's a known problem. Or
>>>>>>> you should double check if you hosts have different ids otherwise they would
>>>>>>> be trying to acquire the same lock.
>>>>>>>
>>>>>>> --Jirka
>>>>>>>
>>>>>>> On 06/06/2014 08:03 AM, Andrew Lau wrote:
>>>>>>>>
>>>>>>>> Hi Ivan,
>>>>>>>>
>>>>>>>> Thanks for the in depth reply.
>>>>>>>>
>>>>>>>> I've only seen this happen twice, and only after I added a third host
>>>>>>>> to the HA cluster. I wonder if that's the root problem.
>>>>>>>>
>>>>>>>> Have you seen this happen on all your installs or only just after your
>>>>>>>> manual migration? It's a little frustrating this is happening as I was
>>>>>>>> hoping to get this into a production environment. It was all working
>>>>>>>> except that log message :(
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster <combuster at archlinux.us> wrote:
>>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>> this is something that I saw in my logs too, first on one node and then
>>>>>>>>> on
>>>>>>>>> the other three. When that happend on all four of them, engine was
>>>>>>>>> corrupted
>>>>>>>>> beyond repair.
>>>>>>>>>
>>>>>>>>> First of all, I think that message is saying that sanlock can't get a
>>>>>>>>> lock
>>>>>>>>> on the shared storage that you defined for the hostedengine during
>>>>>>>>> installation. I got this error when I've tried to manually migrate the
>>>>>>>>> hosted engine. There is an unresolved bug there and I think it's related
>>>>>>>>> to
>>>>>>>>> this one:
>>>>>>>>>
>>>>>>>>> [Bug 1093366 - Migration of hosted-engine vm put target host score to
>>>>>>>>> zero]
>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1093366
>>>>>>>>>
>>>>>>>>> This is a blocker bug (or should be) for the selfhostedengine and, from
>>>>>>>>> my
>>>>>>>>> own experience with it, shouldn't be used in the production enviroment
>>>>>>>>> (not
>>>>>>>>> untill it's fixed).
>>>>>>>>>
>>>>>>>>> Nothing that I've done couldn't fix the fact that the score for the
>>>>>>>>> target
>>>>>>>>> node was Zero, tried to reinstall the node, reboot the node, restarted
>>>>>>>>> several services, tailed a tons of logs etc but to no avail. When only
>>>>>>>>> one
>>>>>>>>> node was left (that was actually running the hosted engine), I brought
>>>>>>>>> the
>>>>>>>>> engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
>>>>>>>>> after
>>>>>>>>> that, when I've tried to start the vm - it wouldn't load. Running VNC
>>>>>>>>> showed
>>>>>>>>> that the filesystem inside the vm was corrupted and when I ran fsck and
>>>>>>>>> finally started up - it was too badly damaged. I succeded to start the
>>>>>>>>> engine itself (after repairing postgresql service that wouldn't want to
>>>>>>>>> start) but the database was damaged enough and acted pretty weird
>>>>>>>>> (showed
>>>>>>>>> that storage domains were down but the vm's were running fine etc).
>>>>>>>>> Lucky
>>>>>>>>> me, I had already exported all of the VM's on the first sign of trouble
>>>>>>>>> and
>>>>>>>>> then installed ovirt-engine on the dedicated server and attached the
>>>>>>>>> export
>>>>>>>>> domain.
>>>>>>>>>
>>>>>>>>> So while really a usefull feature, and it's working (for the most part
>>>>>>>>> ie,
>>>>>>>>> automatic migration works), manually migrating VM with the hosted-engine
>>>>>>>>> will lead to troubles.
>>>>>>>>>
>>>>>>>>> I hope that my experience with it, will be of use to you. It happened to
>>>>>>>>> me
>>>>>>>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
>>>>>>>>> available.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Ivan
>>>>>>>>>
>>>>>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm seeing this weird message in my engine log
>>>>>>>>>
>>>>>>>>> 2014-06-06 03:06:09,380 INFO
>>>>>>>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>>>>>>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
>>>>>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
>>>>>>>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
>>>>>>>>> 2014-06-06 03:06:12,494 INFO
>>>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>>>> (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName =
>>>>>>>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
>>>>>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
>>>>>>>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1
>>>>>>>>> 2014-06-06 03:06:12,561 INFO
>>>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log id:
>>>>>>>>> 62a9d4c1
>>>>>>>>> 2014-06-06 03:06:12,652 INFO
>>>>>>>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>>>> (DefaultQuartzScheduler_
>>>>>>>>> Worker-89) Correlation ID: null, Call Stack:
>>>>>>>>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
>>>>>>>>> message: internal error Failed to acquire lock: error -243.
>>>>>>>>>
>>>>>>>>> It also appears to occur on the other hosts in the cluster, except the
>>>>>>>>> host which is running the hosted-engine. So right now 3 servers, it
>>>>>>>>> shows up twice in the engine UI.
>>>>>>>>>
>>>>>>>>> The engine VM continues to run peacefully, without any issues on the
>>>>>>>>> host which doesn't have that error.
>>>>>>>>>
>>>>>>>>> Any ideas?
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list
>>>>>>>>> Users at ovirt.org
>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at ovirt.org
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>
>>>>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
More information about the Users
mailing list