Nah, I've explicitly allowed hosted-engine vm to be able to access the
NAS device as the NFS share itself, before the deploy procedure even
started. But I'm puzzled at how you can reproduce the bug, all was well
on my setup before I've stated manual migration of the engine's vm. Even
auto migration worked before that (tested it). Does it just happen
without any procedure on the engine itself? Is the score 0 for just one
node, or two of three of them?
On 06/10/2014 01:02 AM, Andrew Lau wrote:
nvm, just as I hit send the error has returned.
Ignore this..
On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau <andrew(a)andrewklau.com> wrote:
> So after adding the L3 capabilities to my storage network, I'm no
> longer seeing this issue anymore. So the engine needs to be able to
> access the storage domain it sits on? But that doesn't show up in the
> UI?
>
> Ivan, was this also the case with your setup? Engine couldn't access
> storage domain?
>
> On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau <andrew(a)andrewklau.com> wrote:
>> Interesting, my storage network is a L2 only and doesn't run on the
>> ovirtmgmt (which is the only thing HostedEngine sees) but I've only
>> seen this issue when running ctdb in front of my NFS server. I
>> previously was using localhost as all my hosts had the nfs server on
>> it (gluster).
>>
>> On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov <alukiano(a)redhat.com>
wrote:
>>> I just blocked connection to storage for testing, but on result I had this
error: "Failed to acquire lock error -243", so I added it in reproduce steps.
>>> If you know another steps to reproduce this error, without blocking
connection to storage it also can be wonderful if you can provide them.
>>> Thanks
>>>
>>> ----- Original Message -----
>>> From: "Andrew Lau" <andrew(a)andrewklau.com>
>>> To: "combuster" <combuster(a)archlinux.us>
>>> Cc: "users" <users(a)ovirt.org>
>>> Sent: Monday, June 9, 2014 3:47:00 AM
>>> Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal
error Failed to acquire lock error -243
>>>
>>> I just ran a few extra tests, I had a 2 host, hosted-engine running
>>> for a day. They both had a score of 2400. Migrated the VM through the
>>> UI multiple times, all worked fine. I then added the third host, and
>>> that's when it all fell to pieces.
>>> Other two hosts have a score of 0 now.
>>>
>>> I'm also curious, in the BZ there's a note about:
>>>
>>> where engine-vm block connection to storage domain(via iptables -I
>>> INPUT -s sd_ip -j DROP)
>>>
>>> What's the purpose for that?
>>>
>>> On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <andrew(a)andrewklau.com>
wrote:
>>>> Ignore that, the issue came back after 10 minutes.
>>>>
>>>> I've even tried a gluster mount + nfs server on top of that, and the
>>>> same issue has come back.
>>>>
>>>> On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <andrew(a)andrewklau.com>
wrote:
>>>>> Interesting, I put it all into global maintenance. Shut it all down
>>>>> for 10~ minutes, and it's regained it's sanlock control and
doesn't
>>>>> seem to have that issue coming up in the log.
>>>>>
>>>>> On Fri, Jun 6, 2014 at 4:21 PM, combuster
<combuster(a)archlinux.us> wrote:
>>>>>> It was pure NFS on a NAS device. They all had different ids (had
no
>>>>>> redeployements of nodes before problem occured).
>>>>>>
>>>>>> Thanks Jirka.
>>>>>>
>>>>>>
>>>>>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:
>>>>>>> I've seen that problem in other threads, the common
denominator was "nfs
>>>>>>> on top of gluster". So if you have this setup, then
it's a known problem. Or
>>>>>>> you should double check if you hosts have different ids
otherwise they would
>>>>>>> be trying to acquire the same lock.
>>>>>>>
>>>>>>> --Jirka
>>>>>>>
>>>>>>> On 06/06/2014 08:03 AM, Andrew Lau wrote:
>>>>>>>> Hi Ivan,
>>>>>>>>
>>>>>>>> Thanks for the in depth reply.
>>>>>>>>
>>>>>>>> I've only seen this happen twice, and only after I
added a third host
>>>>>>>> to the HA cluster. I wonder if that's the root
problem.
>>>>>>>>
>>>>>>>> Have you seen this happen on all your installs or only
just after your
>>>>>>>> manual migration? It's a little frustrating this is
happening as I was
>>>>>>>> hoping to get this into a production environment. It was
all working
>>>>>>>> except that log message :(
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster
<combuster(a)archlinux.us> wrote:
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>> this is something that I saw in my logs too, first on
one node and then
>>>>>>>>> on
>>>>>>>>> the other three. When that happend on all four of
them, engine was
>>>>>>>>> corrupted
>>>>>>>>> beyond repair.
>>>>>>>>>
>>>>>>>>> First of all, I think that message is saying that
sanlock can't get a
>>>>>>>>> lock
>>>>>>>>> on the shared storage that you defined for the
hostedengine during
>>>>>>>>> installation. I got this error when I've tried to
manually migrate the
>>>>>>>>> hosted engine. There is an unresolved bug there and I
think it's related
>>>>>>>>> to
>>>>>>>>> this one:
>>>>>>>>>
>>>>>>>>> [Bug 1093366 - Migration of hosted-engine vm put
target host score to
>>>>>>>>> zero]
>>>>>>>>>
https://bugzilla.redhat.com/show_bug.cgi?id=1093366
>>>>>>>>>
>>>>>>>>> This is a blocker bug (or should be) for the
selfhostedengine and, from
>>>>>>>>> my
>>>>>>>>> own experience with it, shouldn't be used in the
production enviroment
>>>>>>>>> (not
>>>>>>>>> untill it's fixed).
>>>>>>>>>
>>>>>>>>> Nothing that I've done couldn't fix the fact
that the score for the
>>>>>>>>> target
>>>>>>>>> node was Zero, tried to reinstall the node, reboot
the node, restarted
>>>>>>>>> several services, tailed a tons of logs etc but to no
avail. When only
>>>>>>>>> one
>>>>>>>>> node was left (that was actually running the hosted
engine), I brought
>>>>>>>>> the
>>>>>>>>> engine's vm down gracefully (hosted-engine
--vm-shutdown I belive) and
>>>>>>>>> after
>>>>>>>>> that, when I've tried to start the vm - it
wouldn't load. Running VNC
>>>>>>>>> showed
>>>>>>>>> that the filesystem inside the vm was corrupted and
when I ran fsck and
>>>>>>>>> finally started up - it was too badly damaged. I
succeded to start the
>>>>>>>>> engine itself (after repairing postgresql service
that wouldn't want to
>>>>>>>>> start) but the database was damaged enough and acted
pretty weird
>>>>>>>>> (showed
>>>>>>>>> that storage domains were down but the vm's were
running fine etc).
>>>>>>>>> Lucky
>>>>>>>>> me, I had already exported all of the VM's on the
first sign of trouble
>>>>>>>>> and
>>>>>>>>> then installed ovirt-engine on the dedicated server
and attached the
>>>>>>>>> export
>>>>>>>>> domain.
>>>>>>>>>
>>>>>>>>> So while really a usefull feature, and it's
working (for the most part
>>>>>>>>> ie,
>>>>>>>>> automatic migration works), manually migrating VM
with the hosted-engine
>>>>>>>>> will lead to troubles.
>>>>>>>>>
>>>>>>>>> I hope that my experience with it, will be of use to
you. It happened to
>>>>>>>>> me
>>>>>>>>> two weeks ago, ovirt-engine was current (3.4.1) and
there was no fix
>>>>>>>>> available.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Ivan
>>>>>>>>>
>>>>>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm seeing this weird message in my engine log
>>>>>>>>>
>>>>>>>>> 2014-06-06 03:06:09,380 INFO
>>>>>>>>>
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>>>>>>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm
id
>>>>>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status =
WaitForLaunch on vds
>>>>>>>>> ov-hv2-2a-08-23 ignoring it in the refresh until
migration is done
>>>>>>>>> 2014-06-06 03:06:12,494 INFO
>>>>>>>>>
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>>>> (DefaultQuartzScheduler_Worker-89) START,
DestroyVDSCommand(HostName =
>>>>>>>>> ov-hv2-2a-08-23, HostId =
c04c62be-5d34-4e73-bd26-26f805b2dc60,
>>>>>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5,
force=false,
>>>>>>>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1
>>>>>>>>> 2014-06-06 03:06:12,561 INFO
>>>>>>>>>
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>>>> (DefaultQuartzScheduler_Worker-89) FINISH,
DestroyVDSCommand, log id:
>>>>>>>>> 62a9d4c1
>>>>>>>>> 2014-06-06 03:06:12,652 INFO
>>>>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>>>> (DefaultQuartzScheduler_
>>>>>>>>> Worker-89) Correlation ID: null, Call Stack:
>>>>>>>>> null, Custom Event ID: -1, Message: VM HostedEngine
is down. Exit
>>>>>>>>> message: internal error Failed to acquire lock: error
-243.
>>>>>>>>>
>>>>>>>>> It also appears to occur on the other hosts in the
cluster, except the
>>>>>>>>> host which is running the hosted-engine. So right now
3 servers, it
>>>>>>>>> shows up twice in the engine UI.
>>>>>>>>>
>>>>>>>>> The engine VM continues to run peacefully, without
any issues on the
>>>>>>>>> host which doesn't have that error.
>>>>>>>>>
>>>>>>>>> Any ideas?
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list
>>>>>>>>> Users(a)ovirt.org
>>>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users(a)ovirt.org
>>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users(a)ovirt.org
>>>
http://lists.ovirt.org/mailman/listinfo/users