Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).
On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov <alukiano(a)redhat.com> wrote:
I just blocked connection to storage for testing, but on result I had
this error: "Failed to acquire lock error -243", so I added it in reproduce
steps.
If you know another steps to reproduce this error, without blocking connection to storage
it also can be wonderful if you can provide them.
Thanks
----- Original Message -----
From: "Andrew Lau" <andrew(a)andrewklau.com>
To: "combuster" <combuster(a)archlinux.us>
Cc: "users" <users(a)ovirt.org>
Sent: Monday, June 9, 2014 3:47:00 AM
Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed
to acquire lock error -243
I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a score of 0 now.
I'm also curious, in the BZ there's a note about:
where engine-vm block connection to storage domain(via iptables -I
INPUT -s sd_ip -j DROP)
What's the purpose for that?
On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <andrew(a)andrewklau.com> wrote:
> Ignore that, the issue came back after 10 minutes.
>
> I've even tried a gluster mount + nfs server on top of that, and the
> same issue has come back.
>
> On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <andrew(a)andrewklau.com> wrote:
>> Interesting, I put it all into global maintenance. Shut it all down
>> for 10~ minutes, and it's regained it's sanlock control and doesn't
>> seem to have that issue coming up in the log.
>>
>> On Fri, Jun 6, 2014 at 4:21 PM, combuster <combuster(a)archlinux.us> wrote:
>>> It was pure NFS on a NAS device. They all had different ids (had no
>>> redeployements of nodes before problem occured).
>>>
>>> Thanks Jirka.
>>>
>>>
>>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:
>>>>
>>>> I've seen that problem in other threads, the common denominator was
"nfs
>>>> on top of gluster". So if you have this setup, then it's a known
problem. Or
>>>> you should double check if you hosts have different ids otherwise they
would
>>>> be trying to acquire the same lock.
>>>>
>>>> --Jirka
>>>>
>>>> On 06/06/2014 08:03 AM, Andrew Lau wrote:
>>>>>
>>>>> Hi Ivan,
>>>>>
>>>>> Thanks for the in depth reply.
>>>>>
>>>>> I've only seen this happen twice, and only after I added a third
host
>>>>> to the HA cluster. I wonder if that's the root problem.
>>>>>
>>>>> Have you seen this happen on all your installs or only just after
your
>>>>> manual migration? It's a little frustrating this is happening as
I was
>>>>> hoping to get this into a production environment. It was all working
>>>>> except that log message :(
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster
<combuster(a)archlinux.us> wrote:
>>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> this is something that I saw in my logs too, first on one node
and then
>>>>>> on
>>>>>> the other three. When that happend on all four of them, engine
was
>>>>>> corrupted
>>>>>> beyond repair.
>>>>>>
>>>>>> First of all, I think that message is saying that sanlock
can't get a
>>>>>> lock
>>>>>> on the shared storage that you defined for the hostedengine
during
>>>>>> installation. I got this error when I've tried to manually
migrate the
>>>>>> hosted engine. There is an unresolved bug there and I think
it's related
>>>>>> to
>>>>>> this one:
>>>>>>
>>>>>> [Bug 1093366 - Migration of hosted-engine vm put target host
score to
>>>>>> zero]
>>>>>>
https://bugzilla.redhat.com/show_bug.cgi?id=1093366
>>>>>>
>>>>>> This is a blocker bug (or should be) for the selfhostedengine
and, from
>>>>>> my
>>>>>> own experience with it, shouldn't be used in the production
enviroment
>>>>>> (not
>>>>>> untill it's fixed).
>>>>>>
>>>>>> Nothing that I've done couldn't fix the fact that the
score for the
>>>>>> target
>>>>>> node was Zero, tried to reinstall the node, reboot the node,
restarted
>>>>>> several services, tailed a tons of logs etc but to no avail. When
only
>>>>>> one
>>>>>> node was left (that was actually running the hosted engine), I
brought
>>>>>> the
>>>>>> engine's vm down gracefully (hosted-engine --vm-shutdown I
belive) and
>>>>>> after
>>>>>> that, when I've tried to start the vm - it wouldn't load.
Running VNC
>>>>>> showed
>>>>>> that the filesystem inside the vm was corrupted and when I ran
fsck and
>>>>>> finally started up - it was too badly damaged. I succeded to
start the
>>>>>> engine itself (after repairing postgresql service that
wouldn't want to
>>>>>> start) but the database was damaged enough and acted pretty
weird
>>>>>> (showed
>>>>>> that storage domains were down but the vm's were running fine
etc).
>>>>>> Lucky
>>>>>> me, I had already exported all of the VM's on the first sign
of trouble
>>>>>> and
>>>>>> then installed ovirt-engine on the dedicated server and attached
the
>>>>>> export
>>>>>> domain.
>>>>>>
>>>>>> So while really a usefull feature, and it's working (for the
most part
>>>>>> ie,
>>>>>> automatic migration works), manually migrating VM with the
hosted-engine
>>>>>> will lead to troubles.
>>>>>>
>>>>>> I hope that my experience with it, will be of use to you. It
happened to
>>>>>> me
>>>>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no
fix
>>>>>> available.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm seeing this weird message in my engine log
>>>>>>
>>>>>> 2014-06-06 03:06:09,380 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>>>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
>>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on
vds
>>>>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is
done
>>>>>> 2014-06-06 03:06:12,494 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-89) START,
DestroyVDSCommand(HostName =
>>>>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
>>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
>>>>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1
>>>>>> 2014-06-06 03:06:12,561 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log
id:
>>>>>> 62a9d4c1
>>>>>> 2014-06-06 03:06:12,652 INFO
>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>> (DefaultQuartzScheduler_
>>>>>> Worker-89) Correlation ID: null, Call Stack:
>>>>>> null, Custom Event ID: -1, Message: VM HostedEngine is down.
Exit
>>>>>> message: internal error Failed to acquire lock: error -243.
>>>>>>
>>>>>> It also appears to occur on the other hosts in the cluster,
except the
>>>>>> host which is running the hosted-engine. So right now 3 servers,
it
>>>>>> shows up twice in the engine UI.
>>>>>>
>>>>>> The engine VM continues to run peacefully, without any issues on
the
>>>>>> host which doesn't have that error.
>>>>>>
>>>>>> Any ideas?
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users(a)ovirt.org
>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users(a)ovirt.org
>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>
>>>>
>>>
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users