Charles, check the broker log too please. It is possible that the
broker process is running, but is not accepting connections for
example.
Martin
On Wed, Jun 15, 2016 at 3:32 PM, Charles Kozler <charles@fixflyer.com> wrote:
> Actually, broker is the only thing acting "right" between broker and agent.
> Broker is up when I bring the system up but agent is restarting all the
> time. Have a look
>
> The 11th is when I restarted this node after doing 'reinstall' in the web UI
>
> ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability
> Communications Broker
> Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled;
> vendor preset: disabled)
> Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago
> Main PID: 1285 (ovirt-ha-broker)
> CGroup: /system.slice/ovirt-ha-broker.service
> └─1285 /usr/bin/python
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
>
> Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> established
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> closed
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> established
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> closed
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> established
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> closed
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> established
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection
> closed
> Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]:
> INFO:mem_free.MemFree:memFree: 26408
>
> Uptime of proc ..
>
> # ps -Aef | grep -i broker
> vdsm 1285 1 2 Jun11 ? 02:27:50 /usr/bin/python
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
>
> But the agent... is restarting all the time
>
> # ps -Aef | grep -i ovirt-ha-agent
> vdsm 76116 1 0 09:19 ? 00:00:01 /usr/bin/python
> /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
>
> 9:19 AM ET is last restart. Even the logs say it
>
> [root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent'
> agent.log | wc -l
> 232719
>
> And the restarts every
>
> [root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i
> 'restarting agent'
> MainThread::WARNING::2016-06-15
> 09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '6'
> MainThread::WARNING::2016-06-15
> 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '7'
> MainThread::WARNING::2016-06-15
> 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '8'
> MainThread::WARNING::2016-06-15
> 09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '9'
> MainThread::WARNING::2016-06-15
> 09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '0'
> MainThread::WARNING::2016-06-15
> 09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '1'
>
> Full log of restart is like this saying "connection timed out" but its not
> saying to *what* is timing out, so I have nothing else to really go on here
>
> [root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i
> restart
> MainThread::ERROR::2016-06-15
> 09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Error: 'Failed to start monitor <type 'type'>, options {'hostname':
> 'njsevcnp01'}: Connection timed out' - trying to restart agent
> MainThread::WARNING::2016-06-15
> 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '7'
> MainThread::ERROR::2016-06-15
> 09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Error: 'Failed to start monitor <type 'type'>, options {'hostname':
> 'njsevcnp01'}: Connection timed out' - trying to restart agent
> MainThread::WARNING::2016-06-15
> 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '8'
> MainThread::ERROR::2016-06-15
> 09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Error: 'Failed to start monitor <type 'type'>, options {'hostname':
> 'njsevcnp01'}: Connection timed out' - trying to restart agent
> MainThread::WARNING::2016-06-15
> 09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '9'
> MainThread::ERROR::2016-06-15
> 09:26:12,131::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Error: 'Failed to start monitor <type 'type'>, options {'hostname':
> 'njsevcnp01'}: Connection timed out' - trying to restart agent
> MainThread::WARNING::2016-06-15
> 09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '0'
> MainThread::ERROR::2016-06-15
> 09:26:48,058::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Error: 'Failed to start monitor <type 'type'>, options {'hostname':
> 'njsevcnp01'}: Connection timed out' - trying to restart agent
> MainThread::WARNING::2016-06-15
> 09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '1'
> MainThread::ERROR::2016-06-15
> 09:27:23,969::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Error: 'Failed to start monitor <type 'type'>, options {'hostname':
> 'njsevcnp01'}: Connection timed out' - trying to restart agent
> MainThread::WARNING::2016-06-15
> 09:27:28,973::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Restarting agent, attempt '2'
>
>
> Storage is also completely fine. No logs stating anything "going away" or
> having issues. Engine has dedicated NFS NAS device meanwhile VM storage is
> completely separate storage cluster. Storage has 100% dedicated backend
> network with no changes being done
>
>
>
> On Wed, Jun 15, 2016 at 7:42 AM, Martin Sivak <msivak@redhat.com> wrote:
>>
>> > Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent
>> > ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection
>> > closed:
>> > Connection timed out
>> > Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]:
>> > ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error:
>> > 'Failed
>> > to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}:
>> > Connection timed out' - trying to restart agent
>>
>> Broker is broken or down. Check the status of ovirt-ha-broker service.
>>
>> > The other interesting thing is this log from node01. The odd thing is
>> > that
>> > it seems there is some split brain somewhere in oVirt because this log
>> > is
>> > from node02 but it is asking the engine and its getting back "vm not
>> > running
>> > on this host' rather than 'stale data'. But I dont know engine internals
>>
>> This is another piece that points to broker or storage issues. Agent
>> collects local data and then publishes them to other nodes through
>> broker. So it is possible for the agent to know the status of the VM
>> locally, but not be able to publish it.
>>
>> hosted-engine command line tool then reads the synchronization
>> whiteboard too, but it does not see anything that was not published
>> and ends up reporting stale data.
>>
>> >> What is the status of the hosted engine services? systemctl status
>> >> ovirt-ha-agent ovirt-ha-broker
>>
>> Please check the services.
>>
>> Best regards
>>
>> Martin
>>
>> On Tue, Jun 14, 2016 at 2:16 PM, Charles Kozler <charles@fixflyer.com>
>> wrote:
>> > Martin -
>> >
>> > One thing I noticed on all of the nodes is this:
>> >
>> > Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent
>> > ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection
>> > closed:
>> > Connection timed out
>> > Jun 14 08:11:11 njsevcnp01.fixflyer.com ovirt-ha-agent[15713]:
>> > ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error:
>> > 'Failed
>> > to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}:
>> > Connection timed out' - trying to restart agent
>> >
>> > Then the agent is restarted
>> >
>> > [root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep
>> > vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python
>> > /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
>> >
>> > I dont know why the connection would time out because as you can see
>> > that
>> > log is from node01 and I cant figure out why its timing out on the
>> > connection
>> >
>> > The other interesting thing is this log from node01. The odd thing is
>> > that
>> > it seems there is some split brain somewhere in oVirt because this log
>> > is
>> > from node02 but it is asking the engine and its getting back "vm not
>> > running
>> > on this host' rather than 'stale data'. But I dont know engine internals
>> >
>> > MainThread::INFO::2016-06-14
>> >
>> > 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> > Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2,
>> > engine-status:
>> > {reason: vm not running on this host, health: bad, vm: down, detail:
>> > unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df,
>> > host-ts: 3030}
>> > MainThread::INFO::2016-06-14
>> >
>> > 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> > Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3,
>> > engine-status:
>> > {reason: vm not running on this host, health: bad, vm: down, detail:
>> > unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb,
>> > host-ts: 10877406}
>> >
>> >
>> > And that same log on node02 where the engine is running
>> >
>> >
>> > MainThread::INFO::2016-06-14
>> >
>> > 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> > Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1,
>> > engine-status:
>> > {reason: vm not running on this host, health: bad, vm: down, detail:
>> > unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06,
>> > host-ts: 327}
>> > MainThread::INFO::2016-06-14
>> >
>> > 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> > Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3,
>> > engine-status:
>> > {reason: vm not running on this host, health: bad, vm: down, detail:
>> > unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb,
>> > host-ts: 10877406}
>> > MainThread::INFO::2016-06-14
>> >
>> > 08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> > Local (id 2): {engine-health: {health: good, vm: up, detail: up},
>> > bridge:
>> > True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway:
>> > True}
>> > MainThread::INFO::2016-06-14
>> >
>> > 08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> > Trying: notify time=1465906544.45 type=state_transition
>> > detail=StartState-ReinitializeFSM hostname=njsevcnp02
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Jun 14, 2016 at 7:59 AM, Martin Sivak <msivak@redhat.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> is there anything interesting in the hosted engine log files?
>> >> /var/log/ovirt-hosted-engine-ha/agent.log
>> >>
>> >> There should be something appearing there every 10 seconds or faster.
>> >>
>> >> What is the status of the hosted engine services? systemctl status
>> >> ovirt-ha-agent ovirt-ha-broker
>> >>
>> >>
>> >> Best regards
>> >>
>> >> --
>> >> Martin Sivak
>> >> SLA / oVirt
>> >>
>> >> On Sat, Jun 11, 2016 at 8:53 PM, Charles Kozler <charles@fixflyer.com>
>> >> wrote:
>> >> > See linked images please. As you can see all three nodes are
>> >> > reporting
>> >> > stale
>> >> > data. The results of this are:
>> >> >
>> >> > 1. Not all VM's migrate seamlessly in the cluster. Sometimes I have
>> >> > to
>> >> > shut
>> >> > them down to get them to be able to migrate again
>> >> >
>> >> > 2. Hosted engine refuses to move due to constraints (image). This
>> >> > part
>> >> > doesnt make sense to me because I can forcefully shut it down and
>> >> > then
>> >> > go
>> >> > directly on a hosted engine node and bring it back up. Also, the Web
>> >> > UI
>> >> > shows all nodes under the cluster except then it thinks its not apart
>> >> > of
>> >> > the
>> >> > cluster
>> >> >
>> >> > 3. Time is in sync (image)
>> >> >
>> >> > 4. Storage is 100% fine. Gluster back end reports mirroring and
>> >> > status
>> >> > 'started'. No split brain has occurred and ovirt nodes have never
>> >> > lost
>> >> > connectivity to storage
>> >> >
>> >> > 5. I reinstalled all three nodes. For some reason only node 3 still
>> >> > shows as
>> >> > having updates available. (image). For clarity, I did not click
>> >> > "upgrade" I
>> >> > simply did 'reinstall' from the Web UI. Having looked at the output
>> >> > and
>> >> > yum.log from /var/log it almost looks like it did do an update. All
>> >> > package
>> >> > versions across all three nodes are the same (respective to
>> >> > ovirt/vdsm)
>> >> > (image). For some reason though
>> >> > ovirt-engine-appliance-3.6-20160126.1.el7.centos.noarch exists on
>> >> > node 1
>> >> > but
>> >> > not on node 2 or 3. Could this be relative? I dont recall installing
>> >> > that
>> >> > specifically on node 1 but I may have
>> >> >
>> >> > Been slamming my head on this so I am hoping you can provide some
>> >> > assistance
>> >> >
>> >> > http://imgur.com/a/6xkaS
>> >> >
>> >> > Thanks!
>> >> >
>> >> > --
>> >> >
>> >> > Charles Kozler
>> >> > Vice President, IT Operations
>> >> >
>> >> > FIX Flyer, LLC
>> >> > 225 Broadway | Suite 1600 | New York, NY 10007
>> >> > 1-888-349-3593
>> >> > http://www.fixflyer.com
>> >> >
>> >> > NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED
>> >> > RECIPIENT(S)
>> >> > OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS
>> >> > PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING,
>> >> > DISTRIBUTION,
>> >> > OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
>> >> > INFORMATION
>> >> > IS
>> >> > RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT,
>> >> > PLEASE
>> >> > CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM
>> >> > YOUR
>> >> > SYSTEM AND DESTROY ANY COPIES.
>> >> >
>> >> > _______________________________________________
>> >> > Users mailing list
>> >> > Users@ovirt.org
>> >> > http://lists.ovirt.org/mailman/listinfo/users
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> >
>> > Charles Kozler
>> > Vice President, IT Operations
>> >
>> > FIX Flyer, LLC
>> > 225 Broadway | Suite 1600 | New York, NY 10007
>> > 1-888-349-3593
>> > http://www.fixflyer.com
>> >
>> > NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED
>> > RECIPIENT(S)
>> > OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS
>> > PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING,
>> > DISTRIBUTION,
>> > OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION
>> > IS
>> > RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT,
>> > PLEASE
>> > CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM
>> > YOUR
>> > SYSTEM AND DESTROY ANY COPIES.
>
>
>
>
> --
>
> Charles Kozler
> Vice President, IT Operations
>
> FIX Flyer, LLC
> 225 Broadway | Suite 1600 | New York, NY 10007
> 1-888-349-3593
> http://www.fixflyer.com
>
> NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT ONLY FOR THE INTENDED RECIPIENT(S)
> OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION WHICH IS
> PROPRIETARY TO FIX FLYER LLC. ANY UNAUTHORIZED USE, COPYING, DISTRIBUTION,
> OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS INFORMATION IS
> RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED RECIPIENT, PLEASE
> CONTACT THE SENDER BY REPLY E-MAIL AND PLEASE DELETE THIS E-MAIL FROM YOUR
> SYSTEM AND DESTROY ANY COPIES.