On April 9, 2020 11:12:30 AM GMT+03:00, Shareef Jalloq <shareef(a)jalloq.co.uk>
wrote:
OK, let's go through this. I'm looking at the node that at
least still
has
some VMs running. virsh also tells me that the HostedEngine VM is
running
but it's unresponsive and I can't shut it down.
1. All storage domains exist and are mounted.
2. The ha_agent exists:
[root@ovirt-node-01 ovirt-hosted-engine-ha]# ls /rhev/data-center/mnt/
nas-01.phoelex.com\:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/
dom_md ha_agent images master
3. There are two links
[root@ovirt-node-01 ovirt-hosted-engine-ha]# ll /rhev/data-center/mnt/
nas-01.phoelex.com
\:_volume2_vmstore/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ha_agent/
total 8
lrwxrwxrwx. 1 vdsm kvm 132 Apr 2 14:50 hosted-engine.lockspace ->
/var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/ffb90b82-42fe-4253-85d5-aaec8c280aaf/90e68791-0c6f-406a-89ac-e0d86c631604
lrwxrwxrwx. 1 vdsm kvm 132 Apr 2 14:50 hosted-engine.metadata ->
/var/run/vdsm/storage/a6cea67d-dbfb-45cf-a775-b4d0d47b26f2/2161aed0-7250-4c1d-b667-ac94f60af17e/6b818e33-f80a-48cc-a59c-bba641e027d4
4. The services exist but all seem to have some sort of warning:
a) Apr 08 18:10:55
ovirt-node-01.phoelex.com sanlock[1728]: *2020-04-08
18:10:55 1744152 [36796]: s16 delta_renew long write time 10 sec*
b) Mar 23 18:02:59
ovirt-node-01.phoelex.com supervdsmd[29409]: *failed
to
load module nvdimm: libbd_nvdimm.so.2: cannot open shared object file:
No
such file or directory*
c) Apr 09 08:05:13
ovirt-node-01.phoelex.com vdsm[4801]: *ERROR failed
to
retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is
the
Hosted Engine setup finished?*
d)Apr 08 22:48:27
ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08
22:48:27.134+0000: 29309: warning : qemuGetProcessInfo:1404 : cannot
parse
process status data
Apr 08 22:48:27
ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08
22:48:27.134+0000: 29309: error : virNetDevTapInterfaceStats:764 :
internal
error: /proc/net/dev: Interface not found
Apr 08 23:09:39
ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-08
23:09:39.844+0000: 29307: error : virNetSocketReadWire:1806 : End of
file
while reading data: Input/output error
Apr 09 01:05:26
ovirt-node-01.phoelex.com libvirtd[29307]: 2020-04-09
01:05:26.660+0000: 29307: error : virNetSocketReadWire:1806 : End of
file
while reading data: Input/output error
5 & 6. The broker log is continually printing this error:
MainThread::INFO::2020-04-09
08:07:31,438::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
ovirt-hosted-engine-ha broker 2.3.6 started
MainThread::DEBUG::2020-04-09
08:07:31,438::broker::55::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
Running broker
MainThread::DEBUG::2020-04-09
08:07:31,438::broker::120::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_monitor)
Starting monitor
MainThread::INFO::2020-04-09
08:07:31,438::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Searching for submonitors in
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker
/submonitors
MainThread::INFO::2020-04-09
08:07:31,439::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
MainThread::INFO::2020-04-09
08:07:31,440::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
MainThread::INFO::2020-04-09
08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
MainThread::INFO::2020-04-09
08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
MainThread::INFO::2020-04-09
08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
MainThread::INFO::2020-04-09
08:07:31,441::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
MainThread::INFO::2020-04-09
08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
MainThread::INFO::2020-04-09
08:07:31,442::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
MainThread::INFO::2020-04-09
08:07:31,443::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
MainThread::INFO::2020-04-09
08:07:31,444::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
MainThread::INFO::2020-04-09
08:07:31,444::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Finished loading submonitors
MainThread::DEBUG::2020-04-09
08:07:31,444::broker::128::ovirt_hosted_engine_ha.broker.broker.Broker::(_get_storage_broker)
Starting storage broker
MainThread::DEBUG::2020-04-09
08:07:31,444::storage_backends::369::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
Connecting to VDSM
MainThread::DEBUG::2020-04-09
08:07:31,444::util::384::ovirt_hosted_engine_ha.lib.storage_backends::(__log_debug)
Creating a new json-rpc connection to VDSM
Client localhost:54321::DEBUG::2020-04-09
08:07:31,453::concurrent::258::root::(run) START thread <Thread(Client
localhost:54321, started daemon 139992488138496)> (func=<bound method
Reactor.process_requests of <yajsonrpc.betterAsyncore.Reactor object at
0x7f528acabc90>>, args=(), kwargs={})
Client localhost:54321::DEBUG::2020-04-09
08:07:31,459::stompclient::138::yajsonrpc.protocols.stomp.AsyncClient::(_process_connected)
Stomp connection established
MainThread::DEBUG::2020-04-09
08:07:31,467::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::INFO::2020-04-09
08:07:31,530::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
Connecting the storage
MainThread::INFO::2020-04-09
08:07:31,531::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::DEBUG::2020-04-09
08:07:31,531::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:31,534::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:32,199::storage_server::158::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(_validate_pre_connected_path)
Storage domain a6cea67d-dbfb-45cf-a775-b4d0d47b26f2 is not available
MainThread::INFO::2020-04-09
08:07:32,199::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::DEBUG::2020-04-09
08:07:32,199::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:32,814::storage_server::363::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
[{u'status': 0, u'id': u'e29cf818-5ee5-46e1-85c1-8aeefa33e95d'}]
MainThread::INFO::2020-04-09
08:07:32,814::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Refreshing the storage domain
MainThread::DEBUG::2020-04-09
08:07:32,815::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:33,129::storage_server::420::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Error refreshing storage domain: Command StorageDomain.getStats with
args
{'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
(code=350, message=Error in storage domain action:
(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
MainThread::DEBUG::2020-04-09
08:07:33,130::stompclient::294::jsonrpc.AsyncoreClient::(send) Sending
response
MainThread::DEBUG::2020-04-09
08:07:33,795::storage_backends::208::ovirt_hosted_engine_ha.lib.storage_backends::(_get_sector_size)
Command StorageDomain.getInfo with args {'storagedomainID':
'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
(code=350, message=Error in storage domain action:
(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
MainThread::WARNING::2020-04-09
08:07:33,795::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
Can't connect vdsm storage: Command StorageDomain.getInfo with args
{'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'} failed:
(code=350, message=Error in storage domain action:
(u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
The UUID it is moaning about is indeed the one that the HA sits on and
is
the one I listed the contents of in step 2 above.
So why can't it see this domain?
Thanks, Shareef.
On Thu, Apr 9, 2020 at 6:12 AM Strahil Nikolov <hunter86_bg(a)yahoo.com>
wrote:
> On April 9, 2020 1:51:05 AM GMT+03:00, Shareef Jalloq <
> shareef(a)jalloq.co.uk> wrote:
> >Don't know if this is useful or not, but I just tried to shutdown
and
> >start
> >another VM on one of the hosts and get the following error:
> >
> >virsh # start scratch
> >
> >error: Failed to start domain scratch
> >
> >error: Network not found: no network with matching name
> >'vdsm-ovirtmgmt'
> >
> >Is this not referring to the interface name as the network is called
> >'ovirtmgnt'.
> >
> >On Wed, Apr 8, 2020 at 11:35 PM Shareef Jalloq
<shareef(a)jalloq.co.uk>
> >wrote:
> >
> >> Hmmm, virsh tells me the HE is running but it hasn't come up and
the
> >> agent.log is full of the same errors.
> >>
> >> On Wed, Apr 8, 2020 at 11:31 PM Shareef Jalloq
<shareef(a)jalloq.co.uk>
> >> wrote:
> >>
> >>> Ah hah! Ok, so I've managed to start it using virsh on the
second
> >host
> >>> but my first host is still dead.
> >>>
> >>> First of all, what are these 56,317 .prob- files that get dumped
to
> >the
> >>> NFS mounts?
> >>>
> >>> Secondly, why doesn't the node mount the NFS directories at boot?
> >Is
> >>> that the issue with this particular node?
> >>>
> >>> On Wed, Apr 8, 2020 at 11:12 PM <eevans(a)digitaldatatechs.com>
wrote:
> >>>
> >>>> Did you try virsh list --inactive
> >>>>
> >>>>
> >>>>
> >>>> Eric Evans
> >>>>
> >>>> Digital Data Services LLC.
> >>>>
> >>>> 304.660.9080
> >>>>
> >>>>
> >>>>
> >>>> *From:* Shareef Jalloq <shareef(a)jalloq.co.uk>
> >>>> *Sent:* Wednesday, April 8, 2020 5:58 PM
> >>>> *To:* Strahil Nikolov <hunter86_bg(a)yahoo.com>
> >>>> *Cc:* Ovirt Users <users(a)ovirt.org>
> >>>> *Subject:* [ovirt-users] Re: ovirt-engine unresponsive - how to
> >rescue?
> >>>>
> >>>>
> >>>>
> >>>> I've now shut down the VMs on one host and rebooted it but the
> >agent
> >>>> service doesn't start. If I run 'hosted-engine
--vm-status' I
get:
> >>>>
> >>>>
> >>>>
> >>>> The hosted engine configuration has not been retrieved from
shared
> >>>> storage. Please ensure that ovirt-ha-agent is running and the
> >storage
> >>>> server is reachable.
> >>>>
> >>>>
> >>>>
> >>>> and indeed if I list the mounts under /rhev/data-center/mnt,
only
> >one of
> >>>> the directories is mounted. I have 3 NFS mounts, one ISO Domain
> >and two
> >>>> Data Domains. Only one Data Domain has mounted and this has
lots
> >of .prob
> >>>> files in. So why haven't the other NFS exports been mounted?
> >>>>
> >>>>
> >>>>
> >>>> Manually mounting them doesn't seem to have helped much either.
I
> >can
> >>>> start the broker service but the agent service says no. Same
error
> >as the
> >>>> one in my last email.
> >>>>
> >>>>
> >>>>
> >>>> Shareef.
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 8, 2020 at 9:57 PM Shareef Jalloq
> ><shareef(a)jalloq.co.uk>
> >>>> wrote:
> >>>>
> >>>> Right, still down. I've run virsh and it doesn't know
anything
> >about
> >>>> the engine vm.
> >>>>
> >>>>
> >>>>
> >>>> I've restarted the broker and agent services and I still get
> >nothing in
> >>>> virsh->list.
> >>>>
> >>>>
> >>>>
> >>>> In the logs under /var/log/ovirt-hosted-engine-ha I see lots of
> >errors:
> >>>>
> >>>>
> >>>>
> >>>> broker.log:
> >>>>
> >>>>
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,138::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> >>>> ovirt-hosted-engine-ha broker 2.3.6 started
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,138::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Searching for submonitors in
> >>>>
>
>/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,138::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor network
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor cpu-load-no-engine
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor mgmt-bridge
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor network
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor cpu-load
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor engine-health
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor mgmt-bridge
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor cpu-load-no-engine
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor cpu-load
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor mem-free
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor storage-domain
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor storage-domain
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor mem-free
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Loaded submonitor engine-health
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,143::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Finished loading submonitors
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,197::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
> >>>> Connecting the storage
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,197::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
> >>>> Connecting storage server
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,414::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
> >>>> Connecting storage server
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:20,628::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
> >>>> Refreshing the storage domain
> >>>>
> >>>> MainThread::WARNING::2020-04-08
> >>>>
>
>
>20:56:21,057::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
> >>>> Can't connect vdsm storage: Command StorageDomain.getInfo with
args
> >>>> {'storagedomainID':
'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'}
failed:
> >>>>
> >>>> (code=350, message=Error in storage domain action:
> >>>> (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:21,901::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> >>>> ovirt-hosted-engine-ha broker 2.3.6 started
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:56:21,901::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >>>> Searching for submonitors in
> >>>>
>
>/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
> >>>>
> >>>>
> >>>>
> >>>> agent.log:
> >>>>
> >>>>
> >>>>
> >>>> MainThread::ERROR::2020-04-08
> >>>>
>
>
>20:57:00,799::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> >>>> Trying to restart agent
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>20:57:00,799::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
> >>>> Agent shutting down
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>20:57:11,144::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
> >>>> ovirt-hosted-engine-ha agent 2.3.6 started
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:57:11,182::hosted_engine::234::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
> >>>> Found certificate common name:
ovirt-node-01.phoelex.com
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:57:11,294::hosted_engine::543::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
> >>>> Initializing ha-broker connection
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>
>20:57:11,296::brokerlink::80::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
> >>>> Starting monitor network, options {'tcp_t_address':
'',
> >'network_test':
> >>>> 'dns', 'tcp_t_port': '', 'addr':
'192.168.1.99'}
> >>>>
> >>>> MainThread::ERROR::2020-04-08
> >>>>
>
>
>20:57:11,296::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
> >>>> Failed to start necessary monitors
> >>>>
> >>>> MainThread::ERROR::2020-04-08
> >>>>
>
>
>20:57:11,297::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> >>>> Traceback (most recent call last):
> >>>>
> >>>> File
> >>>>
>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> >>>> line 131, in _run_agent
> >>>>
> >>>> return action(he)
> >>>>
> >>>> File
> >>>>
>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> >>>> line 55, in action_proper
> >>>>
> >>>> return he.start_monitoring()
> >>>>
> >>>> File
> >>>>
>
>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> >>>> line 432, in start_monitoring
> >>>>
> >>>> self._initialize_broker()
> >>>>
> >>>> File
> >>>>
>
>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> >>>> line 556, in _initialize_broker
> >>>>
> >>>> m.get('options', {}))
> >>>>
> >>>> File
> >>>>
>
>
>"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> >>>> line 89, in start_monitor
> >>>>
> >>>> ).format(t=type, o=options, e=e)
> >>>>
> >>>> RequestError: brokerlink - failed to start monitor via
> >ovirt-ha-broker:
> >>>> [Errno 2] No such file or directory, [monitor: 'network',
options:
> >>>> {'tcp_t_address': '', 'network_test':
'dns', 'tcp_t_port': '',
> >'addr':
> >>>> '192.168.1.99'}]
> >>>>
> >>>>
> >>>>
> >>>> MainThread::ERROR::2020-04-08
> >>>>
>
>
>20:57:11,297::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> >>>> Trying to restart agent
> >>>>
> >>>> MainThread::INFO::2020-04-08
> >>>>
>
>20:57:11,297::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
> >>>> Agent shutting down
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 8, 2020 at 6:10 PM Strahil Nikolov
> ><hunter86_bg(a)yahoo.com>
> >>>> wrote:
> >>>>
> >>>> On April 8, 2020 7:47:20 PM GMT+03:00, "Maton, Brett"
<
> >>>> matonb(a)ltresources.co.uk> wrote:
> >>>> >On the host you tried to restart the engine on:
> >>>> >
> >>>> >Add an alias to virsh (authenticates with virsh_auth.conf)
> >>>> >
> >>>> >alias virsh='virsh -c
> >>>>
>qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf'
> >>>> >
> >>>> >Then run virsh:
> >>>> >
> >>>> >virsh
> >>>> >
> >>>> >virsh # list
> >>>> > Id Name State
> >>>> >----------------------------------------------------
> >>>> > xx HostedEngine Paused
> >>>> > xx ********** running
> >>>> > ...
> >>>> > xx ********** running
> >>>> >
> >>>> >HostedEngine should be in the list, try and resume the engine:
> >>>> >
> >>>> >virsh # resume HostedEngine
> >>>> >
> >>>> >On Wed, 8 Apr 2020 at 17:28, Shareef Jalloq
<shareef(a)jalloq.co.uk>
> >>>> >wrote:
> >>>> >
> >>>> >> Thanks!
> >>>> >>
> >>>> >> The status hangs due to, I guess, the VM being down....
> >>>> >>
> >>>> >> [root@ovirt-node-01 ~]# hosted-engine --vm-start
> >>>> >> VM exists and is down, cleaning up and restarting
> >>>> >> VM in WaitForLaunch
> >>>> >>
> >>>> >> but this doesn't seem to do anything. OK, after a
while I
get a
> >>>> >status of
> >>>> >> it being barfed...
> >>>> >>
> >>>> >> --== Host
ovirt-node-00.phoelex.com (id: 1) status ==--
> >>>> >>
> >>>> >> conf_on_shared_storage : True
> >>>> >> Status up-to-date : False
> >>>> >> Hostname :
ovirt-node-00.phoelex.com
> >>>> >> Host ID : 1
> >>>> >> Engine status : unknown stale-data
> >>>> >> Score : 3400
> >>>> >> stopped : False
> >>>> >> Local maintenance : False
> >>>> >> crc32 : 9c4a034b
> >>>> >> local_conf_timestamp : 523362
> >>>> >> Host timestamp : 523608
> >>>> >> Extra metadata (valid at timestamp):
> >>>> >> metadata_parse_version=1
> >>>> >> metadata_feature_version=1
> >>>> >> timestamp=523608 (Wed Apr 8 16:17:11 2020)
> >>>> >> host-id=1
> >>>> >> score=3400
> >>>> >> vm_conf_refresh_time=523362 (Wed Apr 8 16:13:06 2020)
> >>>> >> conf_on_shared_storage=True
> >>>> >> maintenance=False
> >>>> >> state=EngineDown
> >>>> >> stopped=False
> >>>> >>
> >>>> >>
> >>>> >> --== Host
ovirt-node-01.phoelex.com (id: 2) status ==--
> >>>> >>
> >>>> >> conf_on_shared_storage : True
> >>>> >> Status up-to-date : True
> >>>> >> Hostname :
ovirt-node-01.phoelex.com
> >>>> >> Host ID : 2
> >>>> >> Engine status : {"reason":
"bad vm
status",
> >>>> >"health":
> >>>> >> "bad", "vm":
"down_unexpected", "detail": "Down"}
> >>>> >> Score : 0
> >>>> >> stopped : False
> >>>> >> Local maintenance : False
> >>>> >> crc32 : 5045f2eb
> >>>> >> local_conf_timestamp : 1737037
> >>>> >> Host timestamp : 1737283
> >>>> >> Extra metadata (valid at timestamp):
> >>>> >> metadata_parse_version=1
> >>>> >> metadata_feature_version=1
> >>>> >> timestamp=1737283 (Wed Apr 8 16:16:17 2020)
> >>>> >> host-id=2
> >>>> >> score=0
> >>>> >> vm_conf_refresh_time=1737037 (Wed Apr 8 16:12:11 2020)
> >>>> >> conf_on_shared_storage=True
> >>>> >> maintenance=False
> >>>> >> state=EngineUnexpectedlyDown
> >>>> >> stopped=False
> >>>> >>
> >>>> >> On Wed, Apr 8, 2020 at 5:09 PM Maton, Brett
> >>>> ><matonb(a)ltresources.co.uk>
> >>>> >> wrote:
> >>>> >>
> >>>> >>> First steps, on one of your hosts as root:
> >>>> >>>
> >>>> >>> To get information:
> >>>> >>> hosted-engine --vm-status
> >>>> >>>
> >>>> >>> To start the engine:
> >>>> >>> hosted-engine --vm-start
> >>>> >>>
> >>>> >>>
> >>>> >>> On Wed, 8 Apr 2020 at 17:00, Shareef Jalloq
> ><shareef(a)jalloq.co.uk>
> >>>> >wrote:
> >>>> >>>
> >>>> >>>> So my engine has gone down and I can't ssh into
it either.
If
> >I
> >>>> >try to
> >>>> >>>> log into the web-ui of the node it is running on, I
get
> >redirected
> >>>> >because
> >>>> >>>> the node can't reach the engine.
> >>>> >>>>
> >>>> >>>> What are my next steps?
> >>>> >>>>
> >>>> >>>> Shareef.
> >>>> >>>> _______________________________________________
> >>>> >>>> Users mailing list -- users(a)ovirt.org
> >>>> >>>> To unsubscribe send an email to
users-leave(a)ovirt.org
> >>>> >>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
> >>>> >>>> oVirt Code of Conduct:
> >>>> >>>>
https://www.ovirt.org/community/about/community-guidelines/
> >>>> >>>> List Archives:
> >>>> >>>>
> >>>> >
> >>>>
> >
>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/W7BP57OCIRS...
> >>>> >>>>
> >>>> >>>
> >>>>
> >>>> This has to be resolved:
> >>>>
> >>>> Engine status : unknown stale-data
> >>>>
> >>>> Run again 'hosted-engine --vm-status'. If it remains the
same,
> >restart
> >>>> ovirt-ha-broker.service & ovirt-ha-agent.service
> >>>>
> >>>> Verify that the engine's storage is available. Then monitor the
> >broker
> >>>> & agent logs in /var/log/ovirt-hosted-engine-ha
> >>>>
> >>>> Best Regards,
> >>>> Strahil Nikolov
> >>>>
> >>>>
> >>>>
> >>>>
>
> Hi Shareef,
>
> The flow of activation oVirt is more complex than a plain KVM.
> Mounting of the domains happen during the activation of the node (
the
> HostedEngine is activating everything needed).
>
> Focus on the HostedEngine VM.
> Is it running properly ?
>
> If not,try:
> 1. Verify that the storage domain exists
> 2. Check if it has 'ha_agents' directory
> 3. Check if the links are OK, if not you can safely remove the links
>
> 4. Next check the services are running:
> A) sanlock
> B) supervdsmd
> C) vdsmd
> D) libvirtd
>
> 5. Increase the log level for broker and agent services:
>
> cd /etc/ovirt-hosted-engine-ha
> vim *-log.conf
>
> systemctl restart ovirt-ha-broker ovirt-ha-agent
>
> 6. Check what they are complaining about
> Keep in mind that agent will keep throwing errors untill the broker
stops
> doing it (agent depends on broker), so broker must be OK before
> peoceeding with the agent log.
>
> About the manual VM start, you need 2 things:
>
> 1. Define the VM network
> # cat vdsm-ovirtmgmt.xml <network>
> <name>vdsm-ovirtmgmt</name>
> <uuid>8ded486e-e681-4754-af4b-5737c2b05405</uuid>
> <forward mode='bridge'/>
> <bridge name='ovirtmgmt'/>
> </network>
>
> [root@ovirt1 HostedEngine-RECOVERY]# virsh define vdsm-ovirtmgmt.xml
>
> 2. Get an xml definition which can be found in the vdsm log. Every VM
at
> start up has it's configuration printed out in vdsm log on the host
it
> starts.
> Save to file and then:
> A) virsh define myvm.xml
> B) virsh start myvm
>
> It seems there is/was a problem with your NFS shares.
>
>
> Best Regards,
> Strahil Nikolov
>
Hey Shareef,
Check if there are any files or folders not owned by vdsm:kvm . Something like this:
find . -not -user 36 -not -group 36 -print
Also check if vdsm can access the images in the '<vol-mount-point>/images'
directories.
Best Regards,
Strahil Nikolov