I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a score of 0 now.
I'm also curious, in the BZ there's a note about:
where engine-vm block connection to storage domain(via iptables -I
INPUT -s sd_ip -j DROP)
What's the purpose for that?
On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau <andrew(a)andrewklau.com> wrote:
Ignore that, the issue came back after 10 minutes.
I've even tried a gluster mount + nfs server on top of that, and the
same issue has come back.
On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau <andrew(a)andrewklau.com> wrote:
> Interesting, I put it all into global maintenance. Shut it all down
> for 10~ minutes, and it's regained it's sanlock control and doesn't
> seem to have that issue coming up in the log.
>
> On Fri, Jun 6, 2014 at 4:21 PM, combuster <combuster(a)archlinux.us> wrote:
>> It was pure NFS on a NAS device. They all had different ids (had no
>> redeployements of nodes before problem occured).
>>
>> Thanks Jirka.
>>
>>
>> On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:
>>>
>>> I've seen that problem in other threads, the common denominator was
"nfs
>>> on top of gluster". So if you have this setup, then it's a known
problem. Or
>>> you should double check if you hosts have different ids otherwise they would
>>> be trying to acquire the same lock.
>>>
>>> --Jirka
>>>
>>> On 06/06/2014 08:03 AM, Andrew Lau wrote:
>>>>
>>>> Hi Ivan,
>>>>
>>>> Thanks for the in depth reply.
>>>>
>>>> I've only seen this happen twice, and only after I added a third
host
>>>> to the HA cluster. I wonder if that's the root problem.
>>>>
>>>> Have you seen this happen on all your installs or only just after your
>>>> manual migration? It's a little frustrating this is happening as I
was
>>>> hoping to get this into a production environment. It was all working
>>>> except that log message :(
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>
>>>> On Fri, Jun 6, 2014 at 3:20 PM, combuster <combuster(a)archlinux.us>
wrote:
>>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> this is something that I saw in my logs too, first on one node and
then
>>>>> on
>>>>> the other three. When that happend on all four of them, engine was
>>>>> corrupted
>>>>> beyond repair.
>>>>>
>>>>> First of all, I think that message is saying that sanlock can't
get a
>>>>> lock
>>>>> on the shared storage that you defined for the hostedengine during
>>>>> installation. I got this error when I've tried to manually
migrate the
>>>>> hosted engine. There is an unresolved bug there and I think it's
related
>>>>> to
>>>>> this one:
>>>>>
>>>>> [Bug 1093366 - Migration of hosted-engine vm put target host score
to
>>>>> zero]
>>>>>
https://bugzilla.redhat.com/show_bug.cgi?id=1093366
>>>>>
>>>>> This is a blocker bug (or should be) for the selfhostedengine and,
from
>>>>> my
>>>>> own experience with it, shouldn't be used in the production
enviroment
>>>>> (not
>>>>> untill it's fixed).
>>>>>
>>>>> Nothing that I've done couldn't fix the fact that the score
for the
>>>>> target
>>>>> node was Zero, tried to reinstall the node, reboot the node,
restarted
>>>>> several services, tailed a tons of logs etc but to no avail. When
only
>>>>> one
>>>>> node was left (that was actually running the hosted engine), I
brought
>>>>> the
>>>>> engine's vm down gracefully (hosted-engine --vm-shutdown I
belive) and
>>>>> after
>>>>> that, when I've tried to start the vm - it wouldn't load.
Running VNC
>>>>> showed
>>>>> that the filesystem inside the vm was corrupted and when I ran fsck
and
>>>>> finally started up - it was too badly damaged. I succeded to start
the
>>>>> engine itself (after repairing postgresql service that wouldn't
want to
>>>>> start) but the database was damaged enough and acted pretty weird
>>>>> (showed
>>>>> that storage domains were down but the vm's were running fine
etc).
>>>>> Lucky
>>>>> me, I had already exported all of the VM's on the first sign of
trouble
>>>>> and
>>>>> then installed ovirt-engine on the dedicated server and attached the
>>>>> export
>>>>> domain.
>>>>>
>>>>> So while really a usefull feature, and it's working (for the most
part
>>>>> ie,
>>>>> automatic migration works), manually migrating VM with the
hosted-engine
>>>>> will lead to troubles.
>>>>>
>>>>> I hope that my experience with it, will be of use to you. It happened
to
>>>>> me
>>>>> two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
>>>>> available.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Ivan
>>>>>
>>>>> On 06/06/2014 05:12 AM, Andrew Lau wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm seeing this weird message in my engine log
>>>>>
>>>>> 2014-06-06 03:06:09,380 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
>>>>> (DefaultQuartzScheduler_Worker-79) RefreshVmList vm id
>>>>> 85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5 status = WaitForLaunch on vds
>>>>> ov-hv2-2a-08-23 ignoring it in the refresh until migration is done
>>>>> 2014-06-06 03:06:12,494 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-89) START, DestroyVDSCommand(HostName
=
>>>>> ov-hv2-2a-08-23, HostId = c04c62be-5d34-4e73-bd26-26f805b2dc60,
>>>>> vmId=85d4cfb9-f063-4c7c-a9f8-2b74f5f7afa5, force=false,
>>>>> secondsToWait=0, gracefully=false), log id: 62a9d4c1
>>>>> 2014-06-06 03:06:12,561 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-89) FINISH, DestroyVDSCommand, log
id:
>>>>> 62a9d4c1
>>>>> 2014-06-06 03:06:12,652 INFO
>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>> (DefaultQuartzScheduler_
>>>>> Worker-89) Correlation ID: null, Call Stack:
>>>>> null, Custom Event ID: -1, Message: VM HostedEngine is down. Exit
>>>>> message: internal error Failed to acquire lock: error -243.
>>>>>
>>>>> It also appears to occur on the other hosts in the cluster, except
the
>>>>> host which is running the hosted-engine. So right now 3 servers, it
>>>>> shows up twice in the engine UI.
>>>>>
>>>>> The engine VM continues to run peacefully, without any issues on the
>>>>> host which doesn't have that error.
>>>>>
>>>>> Any ideas?
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users(a)ovirt.org
>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users(a)ovirt.org
>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>
>>