Hmmm, virsh tells me the HE is running but it hasn't come up and the agent.log is full of the same errors.

On Wed, Apr 8, 2020 at 11:31 PM Shareef Jalloq <shareef@jalloq.co.uk> wrote:
Ah hah!  Ok, so I've managed to start it using virsh on the second host but my first host is still dead.

First of all, what are these 56,317 .prob- files that get dumped to the NFS mounts?

Secondly, why doesn't the node mount the NFS directories at boot?  Is that the issue with this particular node?

On Wed, Apr 8, 2020 at 11:12 PM <eevans@digitaldatatechs.com> wrote:

Did you try virsh list --inactive

 

Eric Evans

Digital Data Services LLC.

304.660.9080

 

From: Shareef Jalloq <shareef@jalloq.co.uk>
Sent: Wednesday, April 8, 2020 5:58 PM
To: Strahil Nikolov <hunter86_bg@yahoo.com>
Cc: Ovirt Users <users@ovirt.org>
Subject: [ovirt-users] Re: ovirt-engine unresponsive - how to rescue?

 

I've now shut down the VMs on one host and rebooted it but the agent service doesn't start.  If I run 'hosted-engine --vm-status' I get:

 

The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

 

and indeed if I list the mounts under /rhev/data-center/mnt, only one of the directories is mounted.  I have 3 NFS mounts, one ISO Domain and two Data Domains.  Only one Data Domain has mounted and this has lots of .prob files in.  So why haven't the other NFS exports been mounted?

 

Manually mounting them doesn't seem to have helped much either.  I can start the broker service but the agent service says no.  Same error as the one in my last email.

 

Shareef.

 

On Wed, Apr 8, 2020 at 9:57 PM Shareef Jalloq <shareef@jalloq.co.uk> wrote:

Right, still down.  I've run virsh and it doesn't know anything about the engine vm.

 

I've restarted the broker and agent services and I still get nothing in virsh->list.

 

In the logs under /var/log/ovirt-hosted-engine-ha I see lots of errors:

 

broker.log:

 

MainThread::INFO::2020-04-08 20:56:20,138::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.3.6 started

MainThread::INFO::2020-04-08 20:56:20,138::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors

MainThread::INFO::2020-04-08 20:56:20,138::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network

MainThread::INFO::2020-04-08 20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine

MainThread::INFO::2020-04-08 20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge

MainThread::INFO::2020-04-08 20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network

MainThread::INFO::2020-04-08 20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load

MainThread::INFO::2020-04-08 20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health

MainThread::INFO::2020-04-08 20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge

MainThread::INFO::2020-04-08 20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine

MainThread::INFO::2020-04-08 20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load

MainThread::INFO::2020-04-08 20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free

MainThread::INFO::2020-04-08 20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain

MainThread::INFO::2020-04-08 20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain

MainThread::INFO::2020-04-08 20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free

MainThread::INFO::2020-04-08 20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health

MainThread::INFO::2020-04-08 20:56:20,143::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors

MainThread::INFO::2020-04-08 20:56:20,197::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect) Connecting the storage

MainThread::INFO::2020-04-08 20:56:20,197::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server

MainThread::INFO::2020-04-08 20:56:20,414::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server

MainThread::INFO::2020-04-08 20:56:20,628::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain

MainThread::WARNING::2020-04-08 20:56:21,057::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Command StorageDomain.getInfo with args {'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:

(code=350, message=Error in storage domain action: (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',)) 

MainThread::INFO::2020-04-08 20:56:21,901::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.3.6 started

MainThread::INFO::2020-04-08 20:56:21,901::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors

 

agent.log:

 

MainThread::ERROR::2020-04-08 20:57:00,799::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent

MainThread::INFO::2020-04-08 20:57:00,799::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down

MainThread::INFO::2020-04-08 20:57:11,144::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.3.6 started

MainThread::INFO::2020-04-08 20:57:11,182::hosted_engine::234::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: ovirt-node-01.phoelex.com

MainThread::INFO::2020-04-08 20:57:11,294::hosted_engine::543::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection

MainThread::INFO::2020-04-08 20:57:11,296::brokerlink::80::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'tcp_t_address': '', 'network_test': 'dns', 'tcp_t_port': '', 'addr': '192.168.1.99'}

MainThread::ERROR::2020-04-08 20:57:11,296::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors

MainThread::ERROR::2020-04-08 20:57:11,297::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent

    return action(he)

  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper

    return he.start_monitoring()

  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring

    self._initialize_broker()

  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 556, in _initialize_broker

    m.get('options', {}))

  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 89, in start_monitor

    ).format(t=type, o=options, e=e)

RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'tcp_t_address': '', 'network_test': 'dns', 'tcp_t_port': '', 'addr': '192.168.1.99'}]

 

MainThread::ERROR::2020-04-08 20:57:11,297::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent

MainThread::INFO::2020-04-08 20:57:11,297::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down

 

On Wed, Apr 8, 2020 at 6:10 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

On April 8, 2020 7:47:20 PM GMT+03:00, "Maton, Brett" <matonb@ltresources.co.uk> wrote:
>On the host you tried to restart the engine on:
>
>Add an alias to virsh (authenticates with virsh_auth.conf)
>
>alias virsh='virsh -c
>qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf'
>
>Then run virsh:
>
>virsh
>
>virsh # list
> Id    Name                           State
>----------------------------------------------------
> xx    HostedEngine                   Paused
> xx    **********                     running
> ...
> xx     **********                     running
>
>HostedEngine should be in the list, try and resume the engine:
>
>virsh # resume HostedEngine
>
>On Wed, 8 Apr 2020 at 17:28, Shareef Jalloq <shareef@jalloq.co.uk>
>wrote:
>
>> Thanks!
>>
>> The status hangs due to, I guess, the VM being down....
>>
>> [root@ovirt-node-01 ~]# hosted-engine --vm-start
>> VM exists and is down, cleaning up and restarting
>> VM in WaitForLaunch
>>
>> but this doesn't seem to do anything.  OK, after a while I get a
>status of
>> it being barfed...
>>
>> --== Host ovirt-node-00.phoelex.com (id: 1) status ==--
>>
>> conf_on_shared_storage             : True
>> Status up-to-date                  : False
>> Hostname                           : ovirt-node-00.phoelex.com
>> Host ID                            : 1
>> Engine status                      : unknown stale-data
>> Score                              : 3400
>> stopped                            : False
>> Local maintenance                  : False
>> crc32                              : 9c4a034b
>> local_conf_timestamp               : 523362
>> Host timestamp                     : 523608
>> Extra metadata (valid at timestamp):
>> metadata_parse_version=1
>> metadata_feature_version=1
>> timestamp=523608 (Wed Apr  8 16:17:11 2020)
>> host-id=1
>> score=3400
>> vm_conf_refresh_time=523362 (Wed Apr  8 16:13:06 2020)
>> conf_on_shared_storage=True
>> maintenance=False
>> state=EngineDown
>> stopped=False
>>
>>
>> --== Host ovirt-node-01.phoelex.com (id: 2) status ==--
>>
>> conf_on_shared_storage             : True
>> Status up-to-date                  : True
>> Hostname                           : ovirt-node-01.phoelex.com
>> Host ID                            : 2
>> Engine status                      : {"reason": "bad vm status",
>"health":
>> "bad", "vm": "down_unexpected", "detail": "Down"}
>> Score                              : 0
>> stopped                            : False
>> Local maintenance                  : False
>> crc32                              : 5045f2eb
>> local_conf_timestamp               : 1737037
>> Host timestamp                     : 1737283
>> Extra metadata (valid at timestamp):
>> metadata_parse_version=1
>> metadata_feature_version=1
>> timestamp=1737283 (Wed Apr  8 16:16:17 2020)
>> host-id=2
>> score=0
>> vm_conf_refresh_time=1737037 (Wed Apr  8 16:12:11 2020)
>> conf_on_shared_storage=True
>> maintenance=False
>> state=EngineUnexpectedlyDown
>> stopped=False
>>
>> On Wed, Apr 8, 2020 at 5:09 PM Maton, Brett
><matonb@ltresources.co.uk>
>> wrote:
>>
>>> First steps, on one of your hosts as root:
>>>
>>> To get information:
>>> hosted-engine --vm-status
>>>
>>> To start the engine:
>>> hosted-engine --vm-start
>>>
>>>
>>> On Wed, 8 Apr 2020 at 17:00, Shareef Jalloq <shareef@jalloq.co.uk>
>wrote:
>>>
>>>> So my engine has gone down and I can't ssh into it either.  If I
>try to
>>>> log into the web-ui of the node it is running on, I get redirected
>because
>>>> the node can't reach the engine.
>>>>
>>>> What are my next steps?
>>>>
>>>> Shareef.
>>>> _______________________________________________
>>>> Users mailing list -- users@ovirt.org
>>>> To unsubscribe send an email to users-leave@ovirt.org
>>>> Privacy Statement: https://www.ovirt.org/privacy-policy.html
>>>> oVirt Code of Conduct:
>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>>
>https://lists.ovirt.org/archives/list/users@ovirt.org/message/W7BP57OCIRSW5CDRQWR5MIKJUH3ISLCQ/
>>>>
>>>

This has  to be resolved:

Engine status                      : unknown stale-data

Run again 'hosted-engine --vm-status'. If it remains the same, restart ovirt-ha-broker.service & ovirt-ha-agent.service

Verify that the engine's storage is available. Then monitor the broker  & agent logs in /var/log/ovirt-hosted-engine-ha

Best Regards,
Strahil Nikolov