OK, let's go through this. I'm looking at the node that at least still has
some VMs running. virsh also tells me that the HostedEngine VM is running
but it's unresponsive and I can't shut it down.
1. All storage domains exist and are mounted.
2. The ha_agent exists:
[root@ovirt-node-01 ovirt-hosted-engine-ha]# ls /rhev/data-center/mnt/
nas-01.phoelex.com\:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/
dom_md ha_agent images master
3. There are two links
[root@ovirt-node-01 ovirt-hosted-engine-ha]# ll /rhev/data-center/mnt/
nas-01.phoelex.com
\:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ha_agent/
total 8
lrwxrwxrwx. 1 vdsm kvm 132 Apr 2 14:50 hosted-engine.lockspace ->
/var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ffb90b82-42fe-4253-85d5-aaec8c280aaf/90e68791-0c6f-406a-89ac-e0d86c631604
lrwxrwxrwx. 1 vdsm kvm 132 Apr 2 14:50 hosted-engine.metadata ->
/var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/2161aed0-7250-4c1d-b667-ac94f60af17e/6b818e33-f80a-48cc-a59c-bba641e027d4
4. The services exist but all seem to have some sort of warning:
a) Apr 08 18:10:55
ovirt-node-01.phoelex.com sanlock[1728]: *2020-04-08
18:10:55 1744152 [36796]: s16 delta_renew long write time 10 sec*
b) Mar 23 18:02:59
ovirt-node-01.phoelex.com supervdsmd[29409]: *failed to
load module nvdimm: libbd_nvdimm.so.2: cannot open shared object file: No
such file or directory*
c) Apr 09 08:05:13
ovirt-node-01.phoelex.com vdsm[4801]: *ERROR failed to
retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the
Hosted Engine setup finished?*
d)Apr 08 22:48:27
ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08
22:48:27.134+0000: 29309: warning : qemuGetProcessInfo:1404 : cannot parse
process status data
Apr 08 22:48:27
ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08
22:48:27.134+0000: 29309: error : virNetDevTapInterfaceStats:764 : internal
error: /proc/net/dev: Interface not found
Apr 08 23:09:39
ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08
23:09:39.844+0000: 29307: error : virNetSocketReadWire:1806 : End of file
while reading data: Input/output error
Apr 09 01:05:26
ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-09
01:05:26.660+0000: 29307: error : virNetSocketReadWire:1806 : End of file
while reading data: Input/output error
5 & 6. The broker log is continually printing this error:
MainThread::INFO::2020-04-09
08:07:31,438::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
ovirt-hosted-engine-ha broker 2.3.6 started
MainThread::DEBUG::2020-04-09
08:07:31,438::broker::55::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
Running broker
MainThread::DEBUG::2020-04-09
08:07:31,438::broker::120::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_monitor)
Starting monitor
MainThread::INFO::2020-04-09
08:07:31,438::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Searching for submonitors in
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker
/submonitors
MainThread::INFO::2020-04-09
08:07:31,439::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
MainThread::INFO::2020-04-09
08:07:31,440::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
MainThread::INFO::2020-04-09
08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
MainThread::INFO::2020-04-09
08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
MainThread::INFO::2020-04-09
08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
MainThread::INFO::2020-04-09
08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
MainThread::INFO::2020-04-09
08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
MainThread::INFO::2020-04-09
08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
MainThread::INFO::2020-04-09
08:07:31,444::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
MainThread::INFO::2020-04-09
08:07:31,444::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Finished loading submonitors
MainThread::DEBUG::2020-04-09
08:07:31,444::broker::128::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_storage_broker)
Starting storage broker
MainThread::DEBUG::2020-04-09
08:07:31,444::storage_backends::369::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
Connecting to VDSM
MainThread::DEBUG::2020-04-09
08:07:31,444::util::384::ovirt_hosted_engine_ha.lib.storage_backends::(__log_debug)
Creating a new json-rpc connection to VDSM
Client localhost:54321::DEBUG::2020-04-09
08:07:31,453::concurrent::258::root::(run) START thread <Thread(Client
localhost:54321, started daemon 139992488138496)> (func=<bound method
Reactor.process_requests of <yajsonrpc.betterAsyncore.Reactor object at
0x7f528acabc90>>, args=(), kwargs={})
Client localhost:54321::DEBUG::2020-04-09
08:07:31,459::stompclient::138::yajsonrpc.protocols.stomp.AsyncClient::(_process_connected)
Stomp connection established
MainThread::DEBUG::2020-04-09
08:07:31,467::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::INFO::2020-04-09
08:07:31,530::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
Connecting the storage
MainThread::INFO::2020-04-09
08:07:31,531::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::DEBUG::2020-04-09
08:07:31,531::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:31,534::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:32,199::storage_server::158::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(_validate_pre_connected_path)
Storage domain a6cea67d-dbfb-45cf-a775-b4d0d47b26f2 is not available
MainThread::INFO::2020-04-09
08:07:32,199::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::DEBUG::2020-04-09
08:07:32,199::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:32,814::storage_server::363::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
[{u'status': 0, u'id': u'e29cf818-5ee5-46e1-85c1-8aeefa33e95d'}]
MainThread::INFO::2020-04-09
08:07:32,814::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Refreshing the storage domain
MainThread::DEBUG::2020-04-09
08:07:32,815::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:33,129::storage_server::420::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Error refreshing storage domain: Command StorageDomain.getStats with args
{'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
(code=350, message=Error in storage domain action:
(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
MainThread::DEBUG::2020-04-09
08:07:33,130::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:33,795::storage_backends::208::ovirt_hosted_engine_ha.lib.storage_backends::(_get_sector_size)
Command StorageDomain.getInfo with args {'storagedomainID':
'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
(code=350, message=Error in storage domain action:
(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
MainThread::WARNING::2020-04-09
08:07:33,795::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
Can't connect vdsm storage: Command StorageDomain.getInfo with args
{'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
(code=350, message=Error in storage domain action:
(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
The UUID it is moaning about is indeed the one that the HA sits on and is
the one I listed the contents of in step 2 above.
So why can't it see this domain?
Thanks, Shareef.
On Thu, Apr 9, 2020 at 6:12 AM Strahil Nikolov <hunter86_bg(a)yahoo.com>
wrote:
On April 9, 2020 1:51:05 AM GMT+03:00, Shareef Jalloq <
shareef(a)jalloq.co.uk> wrote:
>Don't know if this is useful or not, but I just tried to shutdown and
>start
>another VM on one of the hosts and get the following error:
>
>virsh # start scratch
>
>error: Failed to start domain scratch
>
>error: Network not found: no network with matching name
>'vdsm-ovirtmgmt'
>
>Is this not referring to the interface name as the network is called
>'ovirtmgnt'.
>
>On Wed, Apr 8, 2020 at 11:35 PM Shareef Jalloq <shareef(a)jalloq.co.uk>
>wrote:
>
>> Hmmm, virsh tells me the HE is running but it hasn't come up and the
>> agent.log is full of the same errors.
>>
>> On Wed, Apr 8, 2020 at 11:31 PM Shareef Jalloq <shareef(a)jalloq.co.uk>
>> wrote:
>>
>>> Ah hah! Ok, so I've managed to start it using virsh on the second
>host
>>> but my first host is still dead.
>>>
>>> First of all, what are these 56,317 .prob- files that get dumped to
>the
>>> NFS mounts?
>>>
>>> Secondly, why doesn't the node mount the NFS directories at boot?
>Is
>>> that the issue with this particular node?
>>>
>>> On Wed, Apr 8, 2020 at 11:12 PM <eevans(a)digitaldatatechs.com> wrote:
>>>
>>>> Did you try virsh list --inactive
>>>>
>>>>
>>>>
>>>> Eric Evans
>>>>
>>>> Digital Data Services LLC.
>>>>
>>>> 304.660.9080
>>>>
>>>>
>>>>
>>>> *From:* Shareef Jalloq <shareef(a)jalloq.co.uk>
>>>> *Sent:* Wednesday, April 8, 2020 5:58 PM
>>>> *To:* Strahil Nikolov <hunter86_bg(a)yahoo.com>
>>>> *Cc:* Ovirt Users <users(a)ovirt.org>
>>>> *Subject:* [ovirt-users] Re: ovirt-engine unresponsive - how to
>rescue?
>>>>
>>>>
>>>>
>>>> I've now shut down the VMs on one host and rebooted it but the
>agent
>>>> service doesn't start. If I run 'hosted-engine --vm-status'
I get:
>>>>
>>>>
>>>>
>>>> The hosted engine configuration has not been retrieved from shared
>>>> storage. Please ensure that ovirt-ha-agent is running and the
>storage
>>>> server is reachable.
>>>>
>>>>
>>>>
>>>> and indeed if I list the mounts under /rhev/data-center/mnt, only
>one of
>>>> the directories is mounted. I have 3 NFS mounts, one ISO Domain
>and two
>>>> Data Domains. Only one Data Domain has mounted and this has lots
>of .prob
>>>> files in. So why haven't the other NFS exports been mounted?
>>>>
>>>>
>>>>
>>>> Manually mounting them doesn't seem to have helped much either. I
>can
>>>> start the broker service but the agent service says no. Same error
>as the
>>>> one in my last email.
>>>>
>>>>
>>>>
>>>> Shareef.
>>>>
>>>>
>>>>
>>>> On Wed, Apr 8, 2020 at 9:57 PM Shareef Jalloq
><shareef(a)jalloq.co.uk>
>>>> wrote:
>>>>
>>>> Right, still down. I've run virsh and it doesn't know anything
>about
>>>> the engine vm.
>>>>
>>>>
>>>>
>>>> I've restarted the broker and agent services and I still get
>nothing in
>>>> virsh->list.
>>>>
>>>>
>>>>
>>>> In the logs under /var/log/ovirt-hosted-engine-ha I see lots of
>errors:
>>>>
>>>>
>>>>
>>>> broker.log:
>>>>
>>>>
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,138::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
>>>> ovirt-hosted-engine-ha broker 2.3.6 started
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,138::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Searching for submonitors in
>>>>
>/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,138::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor network
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor cpu-load-no-engine
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor mgmt-bridge
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor network
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor cpu-load
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor engine-health
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor mgmt-bridge
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor cpu-load-no-engine
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor cpu-load
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor mem-free
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor storage-domain
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor storage-domain
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor mem-free
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Loaded submonitor engine-health
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,143::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Finished loading submonitors
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,197::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
>>>> Connecting the storage
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,197::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>>>> Connecting storage server
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,414::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>>>> Connecting storage server
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:20,628::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>>>> Refreshing the storage domain
>>>>
>>>> MainThread::WARNING::2020-04-08
>>>>
>20:56:21,057::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
>>>> Can't connect vdsm storage: Command StorageDomain.getInfo with args
>>>> {'storagedomainID':
'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
>>>>
>>>> (code=350, message=Error in storage domain action:
>>>> (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:21,901::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
>>>> ovirt-hosted-engine-ha broker 2.3.6 started
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:56:21,901::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>>>> Searching for submonitors in
>>>>
>/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
>>>>
>>>>
>>>>
>>>> agent.log:
>>>>
>>>>
>>>>
>>>> MainThread::ERROR::2020-04-08
>>>>
>20:57:00,799::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>>>> Trying to restart agent
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:57:00,799::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
>>>> Agent shutting down
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:57:11,144::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
>>>> ovirt-hosted-engine-ha agent 2.3.6 started
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:57:11,182::hosted_engine::234::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
>>>> Found certificate common name:
ovirt-node-01.phoelex.com
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:57:11,294::hosted_engine::543::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
>>>> Initializing ha-broker connection
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:57:11,296::brokerlink::80::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
>>>> Starting monitor network, options {'tcp_t_address': '',
>'network_test':
>>>> 'dns', 'tcp_t_port': '', 'addr':
'192.168.1.99'}
>>>>
>>>> MainThread::ERROR::2020-04-08
>>>>
>20:57:11,296::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
>>>> Failed to start necessary monitors
>>>>
>>>> MainThread::ERROR::2020-04-08
>>>>
>20:57:11,297::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>>>> Traceback (most recent call last):
>>>>
>>>> File
>>>>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
>>>> line 131, in _run_agent
>>>>
>>>> return action(he)
>>>>
>>>> File
>>>>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
>>>> line 55, in action_proper
>>>>
>>>> return he.start_monitoring()
>>>>
>>>> File
>>>>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
>>>> line 432, in start_monitoring
>>>>
>>>> self._initialize_broker()
>>>>
>>>> File
>>>>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
>>>> line 556, in _initialize_broker
>>>>
>>>> m.get('options', {}))
>>>>
>>>> File
>>>>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>>>> line 89, in start_monitor
>>>>
>>>> ).format(t=type, o=options, e=e)
>>>>
>>>> RequestError: brokerlink - failed to start monitor via
>ovirt-ha-broker:
>>>> [Errno 2] No such file or directory, [monitor: 'network',
options:
>>>> {'tcp_t_address': '', 'network_test':
'dns', 'tcp_t_port': '',
>'addr':
>>>> '192.168.1.99'}]
>>>>
>>>>
>>>>
>>>> MainThread::ERROR::2020-04-08
>>>>
>20:57:11,297::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>>>> Trying to restart agent
>>>>
>>>> MainThread::INFO::2020-04-08
>>>>
>20:57:11,297::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
>>>> Agent shutting down
>>>>
>>>>
>>>>
>>>> On Wed, Apr 8, 2020 at 6:10 PM Strahil Nikolov
><hunter86_bg(a)yahoo.com>
>>>> wrote:
>>>>
>>>> On April 8, 2020 7:47:20 PM GMT+03:00, "Maton, Brett" <
>>>> matonb(a)ltresources.co.uk> wrote:
>>>> >On the host you tried to restart the engine on:
>>>> >
>>>> >Add an alias to virsh (authenticates with virsh_auth.conf)
>>>> >
>>>> >alias virsh='virsh -c
>>>>
>qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf'
>>>> >
>>>> >Then run virsh:
>>>> >
>>>> >virsh
>>>> >
>>>> >virsh # list
>>>> > Id Name State
>>>> >----------------------------------------------------
>>>> > xx HostedEngine Paused
>>>> > xx ********** running
>>>> > ...
>>>> > xx ********** running
>>>> >
>>>> >HostedEngine should be in the list, try and resume the engine:
>>>> >
>>>> >virsh # resume HostedEngine
>>>> >
>>>> >On Wed, 8 Apr 2020 at 17:28, Shareef Jalloq
<shareef(a)jalloq.co.uk>
>>>> >wrote:
>>>> >
>>>> >> Thanks!
>>>> >>
>>>> >> The status hangs due to, I guess, the VM being down....
>>>> >>
>>>> >> [root@ovirt-node-01 ~]# hosted-engine --vm-start
>>>> >> VM exists and is down, cleaning up and restarting
>>>> >> VM in WaitForLaunch
>>>> >>
>>>> >> but this doesn't seem to do anything. OK, after a while I
get a
>>>> >status of
>>>> >> it being barfed...
>>>> >>
>>>> >> --== Host
ovirt-node-00.phoelex.com (id: 1) status ==--
>>>> >>
>>>> >> conf_on_shared_storage : True
>>>> >> Status up-to-date : False
>>>> >> Hostname :
ovirt-node-00.phoelex.com
>>>> >> Host ID : 1
>>>> >> Engine status : unknown stale-data
>>>> >> Score : 3400
>>>> >> stopped : False
>>>> >> Local maintenance : False
>>>> >> crc32 : 9c4a034b
>>>> >> local_conf_timestamp : 523362
>>>> >> Host timestamp : 523608
>>>> >> Extra metadata (valid at timestamp):
>>>> >> metadata_parse_version=1
>>>> >> metadata_feature_version=1
>>>> >> timestamp=523608 (Wed Apr 8 16:17:11 2020)
>>>> >> host-id=1
>>>> >> score=3400
>>>> >> vm_conf_refresh_time=523362 (Wed Apr 8 16:13:06 2020)
>>>> >> conf_on_shared_storage=True
>>>> >> maintenance=False
>>>> >> state=EngineDown
>>>> >> stopped=False
>>>> >>
>>>> >>
>>>> >> --== Host
ovirt-node-01.phoelex.com (id: 2) status ==--
>>>> >>
>>>> >> conf_on_shared_storage : True
>>>> >> Status up-to-date : True
>>>> >> Hostname :
ovirt-node-01.phoelex.com
>>>> >> Host ID : 2
>>>> >> Engine status : {"reason":
"bad vm status",
>>>> >"health":
>>>> >> "bad", "vm": "down_unexpected",
"detail": "Down"}
>>>> >> Score : 0
>>>> >> stopped : False
>>>> >> Local maintenance : False
>>>> >> crc32 : 5045f2eb
>>>> >> local_conf_timestamp : 1737037
>>>> >> Host timestamp : 1737283
>>>> >> Extra metadata (valid at timestamp):
>>>> >> metadata_parse_version=1
>>>> >> metadata_feature_version=1
>>>> >> timestamp=1737283 (Wed Apr 8 16:16:17 2020)
>>>> >> host-id=2
>>>> >> score=0
>>>> >> vm_conf_refresh_time=1737037 (Wed Apr 8 16:12:11 2020)
>>>> >> conf_on_shared_storage=True
>>>> >> maintenance=False
>>>> >> state=EngineUnexpectedlyDown
>>>> >> stopped=False
>>>> >>
>>>> >> On Wed, Apr 8, 2020 at 5:09 PM Maton, Brett
>>>> ><matonb(a)ltresources.co.uk>
>>>> >> wrote:
>>>> >>
>>>> >>> First steps, on one of your hosts as root:
>>>> >>>
>>>> >>> To get information:
>>>> >>> hosted-engine --vm-status
>>>> >>>
>>>> >>> To start the engine:
>>>> >>> hosted-engine --vm-start
>>>> >>>
>>>> >>>
>>>> >>> On Wed, 8 Apr 2020 at 17:00, Shareef Jalloq
><shareef(a)jalloq.co.uk>
>>>> >wrote:
>>>> >>>
>>>> >>>> So my engine has gone down and I can't ssh into it
either. If
>I
>>>> >try to
>>>> >>>> log into the web-ui of the node it is running on, I
get
>redirected
>>>> >because
>>>> >>>> the node can't reach the engine.
>>>> >>>>
>>>> >>>> What are my next steps?
>>>> >>>>
>>>> >>>> Shareef.
>>>> >>>> _______________________________________________
>>>> >>>> Users mailing list -- users(a)ovirt.org
>>>> >>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>> >>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>>>> >>>> oVirt Code of Conduct:
>>>> >>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>> >>>> List Archives:
>>>> >>>>
>>>> >
>>>>
>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/W7BP57OCIRS...
>>>> >>>>
>>>> >>>
>>>>
>>>> This has to be resolved:
>>>>
>>>> Engine status : unknown stale-data
>>>>
>>>> Run again 'hosted-engine --vm-status'. If it remains the same,
>restart
>>>> ovirt-ha-broker.service & ovirt-ha-agent.service
>>>>
>>>> Verify that the engine's storage is available. Then monitor the
>broker
>>>> & agent logs in /var/log/ovirt-hosted-engine-ha
>>>>
>>>> Best Regards,
>>>> Strahil Nikolov
>>>>
>>>>
>>>>
>>>>
Hi Shareef,
The flow of activation oVirt is more complex than a plain KVM.
Mounting of the domains happen during the activation of the node ( the
HostedEngine is activating everything needed).
Focus on the HostedEngine VM.
Is it running properly ?
If not,try:
1. Verify that the storage domain exists
2. Check if it has 'ha_agents' directory
3. Check if the links are OK, if not you can safely remove the links
4. Next check the services are running:
A) sanlock
B) supervdsmd
C) vdsmd
D) libvirtd
5. Increase the log level for broker and agent services:
cd /etc/ovirt-hosted-engine-ha
vim *-log.conf
systemctl restart ovirt-ha-broker ovirt-ha-agent
6. Check what they are complaining about
Keep in mind that agent will keep throwing errors untill the broker stops
doing it (agent depends on broker), so broker must be OK before
peoceeding with the agent log.
About the manual VM start, you need 2 things:
1. Define the VM network
# cat vdsm-ovirtmgmt.xml <network>
<name>vdsm-ovirtmgmt</name>
<uuid>8ded486e-e681-4754-af4b-5737c2b05405</uuid>
<forward mode='bridge'/>
<bridge name='ovirtmgmt'/>
</network>
[root@ovirt1 HostedEngine-RECOVERY]# virsh define vdsm-ovirtmgmt.xml
2. Get an xml definition which can be found in the vdsm log. Every VM at
start up has it's configuration printed out in vdsm log on the host it
starts.
Save to file and then:
A) virsh define myvm.xml
B) virsh start myvm
It seems there is/was a problem with your NFS shares.
Best Regards,
Strahil Nikolov