Actually, broker is the only thing acting "right" between broker and agent.
Broker is up when I bring the system up but agent is restarting all the
time. Have a look
The 11th is when I restarted this node after doing 'reinstall' in the web UI
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability
Communications Broker
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service;
enabled; vendor preset: disabled)
Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago
Main PID: 1285 (ovirt-ha-broker)
CGroup: /system.slice/ovirt-ha-broker.service
└─1285 /usr/bin/python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]:
INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
INFO:mem_free.MemFree:memFree: 26408
Uptime of proc ..
# ps -Aef | grep -i broker
vdsm 1285 1 2 Jun11 ? 02:27:50 /usr/bin/python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
But the agent... is restarting all the time
# ps -Aef | grep -i ovirt-ha-agent
vdsm 76116 1 0 09:19 ? 00:00:01 /usr/bin/python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
9:19 AM ET is last restart. Even the logs say it
[root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent'
agent.log | wc -l
232719
And the restarts every
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i
'restarting agent'
MainThread::WARNING::2016-06-15
09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '6'
MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '7'
MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '8'
MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '9'
MainThread::WARNING::2016-06-15
09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '0'
MainThread::WARNING::2016-06-15
09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '1'
Full log of restart is like this saying "connection timed out" but its not
saying to *what* is timing out, so I have nothing else to really go on here
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i
restart
MainThread::ERROR::2016-06-15
09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options
{'hostname':
'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15
09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '7'
MainThread::ERROR::2016-06-15
09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options
{'hostname':
'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15
09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '8'
MainThread::ERROR::2016-06-15
09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options
{'hostname':
'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15
09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '9'
MainThread::ERROR::2016-06-15
09:26:12,131::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options
{'hostname':
'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15
09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '0'
MainThread::ERROR::2016-06-15
09:26:48,058::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options
{'hostname':
'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15
09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '1'
MainThread::ERROR::2016-06-15
09:27:23,969::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Error: 'Failed to start monitor <type 'type'>, options
{'hostname':
'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15
09:27:28,973::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Restarting agent, attempt '2'
Storage is also completely fine. No logs stating anything "going away" or
having issues. Engine has dedicated NFS NAS device meanwhile VM storage is
completely separate storage cluster. Storage has 100% dedicated backend
network with no changes being done
On Wed, Jun 15, 2016 at 7:42 AM, Martin Sivak <msivak(a)redhat.com> wrote:
> Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]:
ovirt-ha-agent
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed:
> Connection timed out
> Jun 14 08:11:11
njsevcnp01.fixflyer.com ovirt-ha-agent[15713]:
> ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error:
'Failed
> to start monitor <type 'type'>, options {'hostname':
'njsevcnp01'}:
> Connection timed out' - trying to restart agent
Broker is broken or down. Check the status of ovirt-ha-broker service.
> The other interesting thing is this log from node01. The odd thing is
that
> it seems there is some split brain somewhere in oVirt because this log is
> from node02 but it is asking the engine and its getting back "vm not
running
> on this host' rather than 'stale data'. But I dont know engine
internals
This is another piece that points to broker or storage issues. Agent
collects local data and then publishes them to other nodes through
broker. So it is possible for the agent to know the status of the VM
locally, but not be able to publish it.
hosted-engine command line tool then reads the synchronization
whiteboard too, but it does not see anything that was not published
and ends up reporting stale data.
>> What is the status of the hosted engine services? systemctl status
>> ovirt-ha-agent ovirt-ha-broker
Please check the services.
Best regards
Martin
On Tue, Jun 14, 2016 at 2:16 PM, Charles Kozler <charles(a)fixflyer.com>
wrote:
> Martin -
>
> One thing I noticed on all of the nodes is this:
>
> Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed:
> Connection timed out
> Jun 14 08:11:11
njsevcnp01.fixflyer.com ovirt-ha-agent[15713]:
> ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error:
'Failed
> to start monitor <type 'type'>, options {'hostname':
'njsevcnp01'}:
> Connection timed out' - trying to restart agent
>
> Then the agent is restarted
>
> [root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep
> vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
>
> I dont know why the connection would time out because as you can see that
> log is from node01 and I cant figure out why its timing out on the
> connection
>
> The other interesting thing is this log from node01. The odd thing is
that
> it seems there is some split brain somewhere in oVirt because this log is
> from node02 but it is asking the engine and its getting back "vm not
running
> on this host' rather than 'stale data'. But I dont know engine
internals
>
> MainThread::INFO::2016-06-14
>
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status:
> {reason: vm not running on this host, health: bad, vm: down, detail:
> unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df,
> host-ts: 3030}
> MainThread::INFO::2016-06-14
>
08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status:
> {reason: vm not running on this host, health: bad, vm: down, detail:
> unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb,
> host-ts: 10877406}
>
>
> And that same log on node02 where the engine is running
>
>
> MainThread::INFO::2016-06-14
>
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status:
> {reason: vm not running on this host, health: bad, vm: down, detail:
> unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06,
> host-ts: 327}
> MainThread::INFO::2016-06-14
>
08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status:
> {reason: vm not running on this host, health: bad, vm: down, detail:
> unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb,
> host-ts: 10877406}
> MainThread::INFO::2016-06-14
>
08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge:
> True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway:
True}
> MainThread::INFO::2016-06-14
>
08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1465906544.45 type=state_transition
> detail=StartState-ReinitializeFSM hostname=njsevcnp02
>
>
>
>
>
>
>
>
> On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak <msivak(a)redhat.com> wrote:
>>
>> Hi,
>>
>> is there anything interesting in the hosted engine log files?
>> /var/log/ovirt-hosted-engine-ha/agent.log
>>
>> There should be something appearing there every 10 seconds or faster.
>>
>> What is the status of the hosted engine services? systemctl status
>> ovirt-ha-agent ovirt-ha-broker
>>
>>
>> Best regards
>>
>> --
>> Martin Sivak
>> SLA / oVirt
>>
>> On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler <charles(a)fixflyer.com>
>> wrote:
>> > See linked images please. As you can see all three nodes are reporting
>> > stale
>> > data. The results of this are:
>> >
>> > 1. Not all VM's migrate seamlessly in the cluster. Sometimes I have to
>> > shut
>> > them down to get them to be able to migrate again
>> >
>> > 2. Hosted engine refuses to move due to constraints (image). This part
>> > doesnt make sense to me because I can forcefully shut it down and
then
>> > go
>> > directly on a hosted engine node and bring it back up. Also, the Web
UI
>> > shows all nodes under the cluster except then it thinks its not apart
of
>> > the
>> > cluster
>> >
>> > 3. Time is in sync (image)
>> >
>> > 4. Storage is 100% fine. Gluster back end reports mirroring and status
>> > 'started'. No split brain has occurred and ovirt nodes have never
lost
>> > connectivity to storage
>> >
>> > 5. I reinstalled all three nodes. For some reason only node 3 still
>> > shows as
>> > having updates available. (image). For clarity, I did not click
>> > "upgrade" I
>> > simply did 'reinstall' from the Web UI. Having looked at the
output
and
>> > yum.log from /var/log it almost looks like it did do an update. All
>> > package
>> > versions across all three nodes are the same (respective to
ovirt/vdsm)
>> > (image). For some reason though
>> > ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on
node 1
>> > but
>> > not on node 2 or 3. Could this be relative? I dont recall installing
>> > that
>> > specifically on node 1 but I may have
>> >
>> > Been slamming my head on this so I am hoping you can provide some
>> > assistance
>> >
>> >
http://imgur.com/a/6xkaS
>> >
>> > Thanks!
>> >
>> > --
>> >
>> > Charles Kozler
>> > Vice President, IT Operations
>> >
>> > FIX Flyer, LLC
>> > 225 Broadway | Suite 1600 | New York, NY 10007
>> > 1-888-349-3593
>> >
http://www.fixflyer.com
>> >
>> > NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED
>> > RECIPIENT(S)
>> > OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS
>> > PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING,
>> > DISTRIBUTION,
>> > OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
INFORMATION
>> > IS
>> > RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT,
>> > PLEASE
>> > CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM
>> > YOUR
>> > SYSTEM AND DESTROY ANY COPIES.
>> >
>> > _______________________________________________
>> > Users mailing list
>> > Users(a)ovirt.org
>> >
http://lists.ovirt.org/mailman/listinfo/users
>> >
>
>
>
>
> --
>
> Charles Kozler
> Vice President, IT Operations
>
> FIX Flyer, LLC
> 225 Broadway | Suite 1600 | New York, NY 10007
> 1-888-349-3593
>
http://www.fixflyer.com
>
> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED
RECIPIENT(S)
> OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS
> PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING,
DISTRIBUTION,
> OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION
IS
> RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE
> CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM
YOUR
> SYSTEM AND DESTROY ANY COPIES.
--
*Charles Kozler*
*Vice President, IT Operations*
FIX Flyer, LLC
225 Broadway | Suite 1600 | New York, NY 10007
1-888-349-3593
http://www.fixflyer.com <
http://fixflyer.com>
NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED
RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
WHICH IS PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING,
DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS
E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.