Ok, I thought I was doing something wrong yesterday and just
tore down my 3-node cluster with the hosted engine and started
rebuilding. I was seeing essentially the same thing, a score of
0 on the VMs not running the engine, it wouldn't allow migration of
the hosted engine. I played with all things related to setting
maintenance and rebooting hosts, nothing brought them up to a
point where I could migrate the hosted engine.
I thought it was related to ovirt messing up when deploying the
other hosts (I told it not to modify the firewall that I disabled,
but the deploy process forcibly reenabled the firewall which gluster
really didn't like). Now after reading this it appears my assumption
may be false.
Previously a 2-node cluster I had worked fine, but I wanted to
go to 3-nodes so I could enable quorum on gluster to not risk
split-brain issues.
-Brad
On 6/10/14 1:19 AM, Andrew Lau wrote:
I'm really having a hard time finding out why it's
happening..
If I set the cluster to global for a minute or two, the scores will
reset back to 2400. Set maintenance mode to none, and all will be fine
until a migration occurs. It seems it tries to migrate, fails and sets
the score to 0 permanently rather than the 10? minutes mentioned in
one of the ovirt slides.
When I have two hosts, it's score 0 only when a migration occurs.
(Just on the host which doesn't have engine up). The score 0 only
happens when it's tried to migrate when I set the host to local
maintenance. Migrating the VM from the UI has worked quite a few
times, but it's recently started to fail.
When I have three hosts, after 5~ mintues of them all up the score
will hit 0 on the hosts not running the VMs. It doesn't even have to
attempt to migrate before the score goes to 0. Stopping the ha agent
on one host, and "resetting" it with the global maintenance method
brings it back to the 2 host scenario above.
I may move on and just go back to a standalone engine as this is not
getting very much luck..
On Tue, Jun 10, 2014 at 3:11 PM, combuster <combuster(a)archlinux.us> wrote:
> Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS
> device as the NFS share itself, before the deploy procedure even started.
> But I'm puzzled at how you can reproduce the bug, all was well on my setup
> before I've stated manual migration of the engine's vm. Even auto migration
> worked before that (tested it). Does it just happen without any procedure on
> the engine itself? Is the score 0 for just one node, or two of three of
> them?
>
> On 06/10/2014 01:02 AM, Andrew Lau wrote:
>>
>> nvm, just as I hit send the error has returned.
>> Ignore this..
>>
>> On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau <andrew(a)andrewklau.com> wrote:
>>>
>>> So after adding the L3 capabilities to my storage network, I'm no
>>> longer seeing this issue anymore. So the engine needs to be able to
>>> access the storage domain it sits on? But that doesn't show up in the
>>> UI?
>>>
>>> Ivan, was this also the case with your setup? Engine couldn't access
>>> storage domain?
>>>
>>> On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau <andrew(a)andrewklau.com>
wrote:
>>>>
>>>> Interesting, my storage network is a L2 only and doesn't run on the
>>>> ovirtmgmt (which is the only thing HostedEngine sees) but I've only
>>>> seen this issue when running ctdb in front of my NFS server. I
>>>> previously was using localhost as all my hosts had the nfs server on
>>>> it (gluster).
>>>>
>>>> On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov
<alukiano(a)redhat.com>
>>>> wrote:
>>>>>
>>>>> I just blocked connection to storage for testing, but on result I
had
>>>>> this error: "Failed to acquire lock error -243", so I added
it in reproduce
>>>>> steps.
>>>>> If you know another steps to reproduce this error, without blocking
>>>>> connection to storage it also can be wonderful if you can provide
them.
>>>>> Thanks
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Andrew Lau" <andrew(a)andrewklau.com>
>>>>> To: "combuster" <combuster(a)archlinux.us>
>>>>> Cc: "users" <users(a)ovirt.org>
>>>>> Sent: Monday, June 9, 2014 3:47:00 AM
>>>>> Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message:
>>>>> internal error Failed to acquire lock error -243
>>>>>
>>>>> I just ran a few extra tests, I had a 2 host, hosted-engine running
>>>>> for a day. They both had a score of 2400. Migrated the VM through
the
>>>>> UI multiple times, all worked fine. I then added the third host, and
>>>>> that's when it all fell to pieces.
>>>>> Other two hosts have a score of 0 now.
>>>>>
>>>>> I'm also curious, in the BZ there's a note about:
>>>>>
>>>>> where engine-vm block connection to storage domain(via iptables -I
>>>>> INPUT -s sd_ip -j DROP)
>>>>>
>>>>> What's the purpose for that?
>>>>>
>>>>> On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau
<andrew(a)andrewklau.com>
>>>>> wrote:
>>>>>>
>>>>>> Ignore that, the issue came back after 10 minutes.
>>>>>>
>>>>>> I've even tried a gluster mount + nfs server on top of that,
and the
>>>>>> same issue has come back.
>>>>>>
>>>>>> On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau
<andrew(a)andrewklau.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Interesting, I put it all into global maintenance. Shut it
all down
>>>>>>> for 10~ minutes, and it's regained it's sanlock
control and doesn't
>>>>>>> seem to have that issue coming up in the log.
>>>>>>>
>>>>>>> On Fri, Jun 6, 2014 at 4:21 PM, combuster
<combuster(a)archlinux.us>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> It was pure NFS on a NAS device. They all had different
ids (had no
>>>>>>>> redeployements of nodes before problem occured).
>>>>>>>>
>>>>>>>> Thanks Jirka.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:
>>>>>>>>>
>>>>>>>>> I've seen that problem in other threads, the
common denominator was
>>>>>>>>> "nfs
>>>>>>>>> on top of gluster". So if you have this setup,
then it's a known
>>>>>>>>> problem. Or
>>>>>>>>> you should double check if you hosts have different
ids otherwise
>>>>>>>>> they would
>>>>>>>>> be trying to acquire the same lock.
>>>>>>>>>
>>>>>>>>> --Jirka
>>>>>>>>>
>>>>>>>>> On 06/06/2014 08:03 AM, Andrew Lau wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Ivan,
>>>>>>>>>>
>>>>>>>>>> Thanks for the in depth reply.
>>>>>>>>>>
>>>>>>>>>> I've only seen this happen twice, and only
after I added a third
>>>>>>>>>> host
>>>>>>>>>> to the HA cluster. I wonder if that's the
root problem.
>>>>>>>>>>
>>>>>>>>>> Have you seen this happen on all your installs or
only just after
>>>>>>>>>> your
>>>>>>>>>> manual migration? It's a little frustrating
this is happening as I
>>>>>>>>>> was
>>>>>>>>>> hoping to get this into a production environment.
It was all
>>>>>>>>>> working
>>>>>>>>>> except that log message :(
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster
<combuster(a)archlinux.us>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>> this is something that I saw in my logs too,
first on one node
>>>>>>>>>>> and then
>>>>>>>>>>> on
>>>>>>>>>>> the other three. When that happend on all
four of them, engine
>>>>>>>>>>> was
>>>>>>>>>>> corrupted
>>>>>>>>>>> beyond repair.
>>>>>>>>>>>
>>>>>>>>>>> First of all, I think that message is saying
that sanlock can't
>>>>>>>>>>> get a
>>>>>>>>>>> lock
>>>>>>>>>>> on the shared storage that you defined for
the hostedengine
>>>>>>>>>>> during
>>>>>>>>>>> installation. I got this error when I've
tried to manually
>>>>>>>>>>> migrate the
>>>>>>>>>>> hosted engine. There is an unresolved bug
there and I think it's
>>>>>>>>>>> related
>>>>>>>>>>> to
>>>>>>>>>>> this one:
>>>>>>>>>>>
>>>>>>>>>>> [Bug 1093366 - Migration of hosted-engine vm
put target host
>>>>>>>>>>> score to
>>>>>>>>>>> zero]
>>>>>>>>>>>
https://bugzilla.redhat.com/show_bug.cgi?id=1093366
>>>>>>>>>>>
>>>>>>>>>>> This is a blocker bug (or should be) for the
selfhostedengine
>>>>>>>>>>> and, from
>>>>>>>>>>> my
>>>>>>>>>>> own experience with it, shouldn't be used
in the production
>>>>>>>>>>> enviroment
>>>>>>>>>>> (not
>>>>>>>>>>> untill it's fixed).
>>>>>>>>>>>
>>>>>>>>>>> Nothing that I've done couldn't fix
the fact that the score for
>>>>>>>>>>> the
>>>>>>>>>>> target
>>>>>>>>>>> node was Zero, tried to reinstall the node,
reboot the node,
>>>>>>>>>>> restarted
>>>>>>>>>>> several services, tailed a tons of logs etc
but to no avail. When
>>>>>>>>>>> only
>>>>>>>>>>> one
>>>>>>>>>>> node was left (that was actually running the
hosted engine), I
>>>>>>>>>>> brought
>>>>>>>>>>> the
>>>>>>>>>>> engine's vm down gracefully
(hosted-engine --vm-shutdown I
>>>>>>>>>>> belive) and
>>>>>>>>>>> after
>>>>>>>>>>> that, when I've tried to start the vm -
it wouldn't load. Running
>>>>>>>>>>> VNC
>>>>>>>>>>> showed
>>>>>>>>>>> that the filesystem inside the vm was
corrupted and when I ran
>>>>>>>>>>> fsck and
>>>>>>>>>>> finally started up - it was too badly
damaged. I succeded to
>>>>>>>>>>> start the
>>>>>>>>>>> engine itself (after repairing postgresql
service that wouldn't
>>>>>>>>>>> want to
>>>>>>>>>>> start) but the database was damaged enough
and acted pretty weird
>>>>>>>>>>> (showed
>>>>>>>>>>> that storage domains were down but the
vm's were running fine
>>>>>>>>>>> etc).
>>>>>>>>>>> Lucky
>>>>>>>>>>> me, I had already exported all of the
VM's on the first sign of
>>>>>>>>>>> trouble
>>>>>>>>>>> and
>>>>>>>>>>> then installed ovirt-engine on the dedicated
server and attached
>>>>>>>>>>> the
>>>>>>>>>>> export
>>>>>>>>>>> domain.
>>>>>>>>>>>
>>>>>>>>>>> So while really a usefull feature, and
it's working (for the most
>>>>>>>>>>> part
>>>>>>>>>>> ie,
>>>>>>>>>>> automatic migration works), manually
migrating VM with the
>>>>>>>>>>> hosted-engine
>>>>>>>>>>> will lead to troubles.
>>>>>>>>>>>
>>>>>>>>>>> I hope that my experience with it, will be of
use to you. It
>>>>>>>>>>> happened to
>>>>>>>>>>> me
>>>>>>>>>>> two weeks ago, ovirt-engine was current
(3.4.1) and there was no
>>>>>>>>>>> fix
>>>>>>>>>>> available.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Ivan
>>>>>>>>>>>
>>>>>>>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I'm seeing this weird message in my
engine log
>>>>>>>>>>>
>>>>>>>>>>> 2014-06-06 03:06:09,380 INFO
>>>>>>>>>>>
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>>>>>>>>>> (DefaultQuartzScheduler_Worker-79)
RefreshVmList vm id
>>>>>>>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status =
WaitForLaunch on
>>>>>>>>>>> vds
>>>>>>>>>>> ov-hv2-2a-08-23 ignoring it in the refresh
until migration is
>>>>>>>>>>> done
>>>>>>>>>>> 2014-06-06 03:06:12,494 INFO
>>>>>>>>>>>
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>>>>>> (DefaultQuartzScheduler_Worker-89) START,
>>>>>>>>>>> DestroyVDSCommand(HostName =
>>>>>>>>>>> ov-hv2-2a-08-23, HostId =
c04c62be-5d34-4e73-bd26-26f805b2dc60,
>>>>>>>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5,
force=false,
>>>>>>>>>>> secondsToWait=0, gracefully=false), log id:
62a9d4c1
>>>>>>>>>>> 2014-06-06 03:06:12,561 INFO
>>>>>>>>>>>
[org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>>>>>>>> (DefaultQuartzScheduler_Worker-89) FINISH,
DestroyVDSCommand, log
>>>>>>>>>>> id:
>>>>>>>>>>> 62a9d4c1
>>>>>>>>>>> 2014-06-06 03:06:12,652 INFO
>>>>>>>>>>>
>>>>>>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>>>>>> (DefaultQuartzScheduler_
>>>>>>>>>>> Worker-89) Correlation ID: null, Call Stack:
>>>>>>>>>>> null, Custom Event ID: -1, Message: VM
HostedEngine is down. Exit
>>>>>>>>>>> message: internal error Failed to acquire
lock: error -243.
>>>>>>>>>>>
>>>>>>>>>>> It also appears to occur on the other hosts
in the cluster,
>>>>>>>>>>> except the
>>>>>>>>>>> host which is running the hosted-engine. So
right now 3 servers,
>>>>>>>>>>> it
>>>>>>>>>>> shows up twice in the engine UI.
>>>>>>>>>>>
>>>>>>>>>>> The engine VM continues to run peacefully,
without any issues on
>>>>>>>>>>> the
>>>>>>>>>>> host which doesn't have that error.
>>>>>>>>>>>
>>>>>>>>>>> Any ideas?
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> Users mailing list
>>>>>>>>>>> Users(a)ovirt.org
>>>>>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Users mailing list
>>>>>>>>>> Users(a)ovirt.org
>>>>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users(a)ovirt.org
>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>
>
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users