Don't know if this is useful or not, but I just tried to shutdown and start
another VM on one of the hosts and get the following error:
virsh # start scratch
error: Failed to start domain scratch
error: Network not found: no network with matching name 'vdsm-ovirtmgmt'
Is this not referring to the interface name as the network is called
'ovirtmgnt'.
On Wed, Apr 8, 2020 at 11:35 PM Shareef Jalloq <shareef(a)jalloq.co.uk> wrote:
Hmmm, virsh tells me the HE is running but it hasn't come up and
the
agent.log is full of the same errors.
On Wed, Apr 8, 2020 at 11:31 PM Shareef Jalloq <shareef(a)jalloq.co.uk>
wrote:
> Ah hah! Ok, so I've managed to start it using virsh on the second host
> but my first host is still dead.
>
> First of all, what are these 56,317 .prob- files that get dumped to the
> NFS mounts?
>
> Secondly, why doesn't the node mount the NFS directories at boot? Is
> that the issue with this particular node?
>
> On Wed, Apr 8, 2020 at 11:12 PM <eevans(a)digitaldatatechs.com> wrote:
>
>> Did you try virsh list --inactive
>>
>>
>>
>> Eric Evans
>>
>> Digital Data Services LLC.
>>
>> 304.660.9080
>>
>>
>>
>> *From:* Shareef Jalloq <shareef(a)jalloq.co.uk>
>> *Sent:* Wednesday, April 8, 2020 5:58 PM
>> *To:* Strahil Nikolov <hunter86_bg(a)yahoo.com>
>> *Cc:* Ovirt Users <users(a)ovirt.org>
>> *Subject:* [ovirt-users] Re: ovirt-engine unresponsive - how to rescue?
>>
>>
>>
>> I've now shut down the VMs on one host and rebooted it but the agent
>> service doesn't start. If I run 'hosted-engine --vm-status' I get:
>>
>>
>>
>> The hosted engine configuration has not been retrieved from shared
>> storage. Please ensure that ovirt-ha-agent is running and the storage
>> server is reachable.
>>
>>
>>
>> and indeed if I list the mounts under /rhev/data-center/mnt, only one of
>> the directories is mounted. I have 3 NFS mounts, one ISO Domain and two
>> Data Domains. Only one Data Domain has mounted and this has lots of .prob
>> files in. So why haven't the other NFS exports been mounted?
>>
>>
>>
>> Manually mounting them doesn't seem to have helped much either. I can
>> start the broker service but the agent service says no. Same error as the
>> one in my last email.
>>
>>
>>
>> Shareef.
>>
>>
>>
>> On Wed, Apr 8, 2020 at 9:57 PM Shareef Jalloq <shareef(a)jalloq.co.uk>
>> wrote:
>>
>> Right, still down. I've run virsh and it doesn't know anything about
>> the engine vm.
>>
>>
>>
>> I've restarted the broker and agent services and I still get nothing in
>> virsh->list.
>>
>>
>>
>> In the logs under /var/log/ovirt-hosted-engine-ha I see lots of errors:
>>
>>
>>
>> broker.log:
>>
>>
>>
>> MainThread::INFO::2020-04-08
>> 20:56:20,138::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
>> ovirt-hosted-engine-ha broker 2.3.6 started
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,138::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Searching for submonitors in
>> /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,138::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor network
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor cpu-load-no-engine
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,140::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor mgmt-bridge
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor network
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor cpu-load
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor engine-health
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,141::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor mgmt-bridge
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor cpu-load-no-engine
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor cpu-load
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,142::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor mem-free
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor storage-domain
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor storage-domain
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor mem-free
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,143::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Loaded submonitor engine-health
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,143::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Finished loading submonitors
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,197::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
>> Connecting the storage
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,197::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>> Connecting storage server
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,414::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>> Connecting storage server
>>
>> MainThread::INFO::2020-04-08
>>
20:56:20,628::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>> Refreshing the storage domain
>>
>> MainThread::WARNING::2020-04-08
>>
20:56:21,057::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
>> Can't connect vdsm storage: Command StorageDomain.getInfo with args
>> {'storagedomainID': 'a6cea67d-dbfb-45cf-a775-b4d0d47b26f2'}
failed:
>>
>> (code=350, message=Error in storage domain action:
>> (u'sdUUID=a6cea67d-dbfb-45cf-a775-b4d0d47b26f2',))
>>
>> MainThread::INFO::2020-04-08
>> 20:56:21,901::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
>> ovirt-hosted-engine-ha broker 2.3.6 started
>>
>> MainThread::INFO::2020-04-08
>>
20:56:21,901::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
>> Searching for submonitors in
>> /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
>>
>>
>>
>> agent.log:
>>
>>
>>
>> MainThread::ERROR::2020-04-08
>> 20:57:00,799::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>> Trying to restart agent
>>
>> MainThread::INFO::2020-04-08
>> 20:57:00,799::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
>> Agent shutting down
>>
>> MainThread::INFO::2020-04-08
>> 20:57:11,144::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
>> ovirt-hosted-engine-ha agent 2.3.6 started
>>
>> MainThread::INFO::2020-04-08
>>
20:57:11,182::hosted_engine::234::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
>> Found certificate common name:
ovirt-node-01.phoelex.com
>>
>> MainThread::INFO::2020-04-08
>>
20:57:11,294::hosted_engine::543::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
>> Initializing ha-broker connection
>>
>> MainThread::INFO::2020-04-08
>>
20:57:11,296::brokerlink::80::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
>> Starting monitor network, options {'tcp_t_address': '',
'network_test':
>> 'dns', 'tcp_t_port': '', 'addr':
'192.168.1.99'}
>>
>> MainThread::ERROR::2020-04-08
>>
20:57:11,296::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
>> Failed to start necessary monitors
>>
>> MainThread::ERROR::2020-04-08
>> 20:57:11,297::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>> Traceback (most recent call last):
>>
>> File
>>
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
>> line 131, in _run_agent
>>
>> return action(he)
>>
>> File
>>
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
>> line 55, in action_proper
>>
>> return he.start_monitoring()
>>
>> File
>>
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
>> line 432, in start_monitoring
>>
>> self._initialize_broker()
>>
>> File
>>
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
>> line 556, in _initialize_broker
>>
>> m.get('options', {}))
>>
>> File
>>
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>> line 89, in start_monitor
>>
>> ).format(t=type, o=options, e=e)
>>
>> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker:
>> [Errno 2] No such file or directory, [monitor: 'network', options:
>> {'tcp_t_address': '', 'network_test': 'dns',
'tcp_t_port': '', 'addr':
>> '192.168.1.99'}]
>>
>>
>>
>> MainThread::ERROR::2020-04-08
>> 20:57:11,297::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
>> Trying to restart agent
>>
>> MainThread::INFO::2020-04-08
>> 20:57:11,297::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
>> Agent shutting down
>>
>>
>>
>> On Wed, Apr 8, 2020 at 6:10 PM Strahil Nikolov <hunter86_bg(a)yahoo.com>
>> wrote:
>>
>> On April 8, 2020 7:47:20 PM GMT+03:00, "Maton, Brett" <
>> matonb(a)ltresources.co.uk> wrote:
>> >On the host you tried to restart the engine on:
>> >
>> >Add an alias to virsh (authenticates with virsh_auth.conf)
>> >
>> >alias virsh='virsh -c
>> >qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf'
>> >
>> >Then run virsh:
>> >
>> >virsh
>> >
>> >virsh # list
>> > Id Name State
>> >----------------------------------------------------
>> > xx HostedEngine Paused
>> > xx ********** running
>> > ...
>> > xx ********** running
>> >
>> >HostedEngine should be in the list, try and resume the engine:
>> >
>> >virsh # resume HostedEngine
>> >
>> >On Wed, 8 Apr 2020 at 17:28, Shareef Jalloq <shareef(a)jalloq.co.uk>
>> >wrote:
>> >
>> >> Thanks!
>> >>
>> >> The status hangs due to, I guess, the VM being down....
>> >>
>> >> [root@ovirt-node-01 ~]# hosted-engine --vm-start
>> >> VM exists and is down, cleaning up and restarting
>> >> VM in WaitForLaunch
>> >>
>> >> but this doesn't seem to do anything. OK, after a while I get a
>> >status of
>> >> it being barfed...
>> >>
>> >> --== Host
ovirt-node-00.phoelex.com (id: 1) status ==--
>> >>
>> >> conf_on_shared_storage : True
>> >> Status up-to-date : False
>> >> Hostname :
ovirt-node-00.phoelex.com
>> >> Host ID : 1
>> >> Engine status : unknown stale-data
>> >> Score : 3400
>> >> stopped : False
>> >> Local maintenance : False
>> >> crc32 : 9c4a034b
>> >> local_conf_timestamp : 523362
>> >> Host timestamp : 523608
>> >> Extra metadata (valid at timestamp):
>> >> metadata_parse_version=1
>> >> metadata_feature_version=1
>> >> timestamp=523608 (Wed Apr 8 16:17:11 2020)
>> >> host-id=1
>> >> score=3400
>> >> vm_conf_refresh_time=523362 (Wed Apr 8 16:13:06 2020)
>> >> conf_on_shared_storage=True
>> >> maintenance=False
>> >> state=EngineDown
>> >> stopped=False
>> >>
>> >>
>> >> --== Host
ovirt-node-01.phoelex.com (id: 2) status ==--
>> >>
>> >> conf_on_shared_storage : True
>> >> Status up-to-date : True
>> >> Hostname :
ovirt-node-01.phoelex.com
>> >> Host ID : 2
>> >> Engine status : {"reason": "bad vm
status",
>> >"health":
>> >> "bad", "vm": "down_unexpected",
"detail": "Down"}
>> >> Score : 0
>> >> stopped : False
>> >> Local maintenance : False
>> >> crc32 : 5045f2eb
>> >> local_conf_timestamp : 1737037
>> >> Host timestamp : 1737283
>> >> Extra metadata (valid at timestamp):
>> >> metadata_parse_version=1
>> >> metadata_feature_version=1
>> >> timestamp=1737283 (Wed Apr 8 16:16:17 2020)
>> >> host-id=2
>> >> score=0
>> >> vm_conf_refresh_time=1737037 (Wed Apr 8 16:12:11 2020)
>> >> conf_on_shared_storage=True
>> >> maintenance=False
>> >> state=EngineUnexpectedlyDown
>> >> stopped=False
>> >>
>> >> On Wed, Apr 8, 2020 at 5:09 PM Maton, Brett
>> ><matonb(a)ltresources.co.uk>
>> >> wrote:
>> >>
>> >>> First steps, on one of your hosts as root:
>> >>>
>> >>> To get information:
>> >>> hosted-engine --vm-status
>> >>>
>> >>> To start the engine:
>> >>> hosted-engine --vm-start
>> >>>
>> >>>
>> >>> On Wed, 8 Apr 2020 at 17:00, Shareef Jalloq
<shareef(a)jalloq.co.uk>
>> >wrote:
>> >>>
>> >>>> So my engine has gone down and I can't ssh into it either.
If I
>> >try to
>> >>>> log into the web-ui of the node it is running on, I get
redirected
>> >because
>> >>>> the node can't reach the engine.
>> >>>>
>> >>>> What are my next steps?
>> >>>>
>> >>>> Shareef.
>> >>>> _______________________________________________
>> >>>> Users mailing list -- users(a)ovirt.org
>> >>>> To unsubscribe send an email to users-leave(a)ovirt.org
>> >>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>> >>>> oVirt Code of Conduct:
>> >>>>
https://www.ovirt.org/community/about/community-guidelines/
>> >>>> List Archives:
>> >>>>
>> >
>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/W7BP57OCIRS...
>> >>>>
>> >>>
>>
>> This has to be resolved:
>>
>> Engine status : unknown stale-data
>>
>> Run again 'hosted-engine --vm-status'. If it remains the same, restart
>> ovirt-ha-broker.service & ovirt-ha-agent.service
>>
>> Verify that the engine's storage is available. Then monitor the broker
>> & agent logs in /var/log/ovirt-hosted-engine-ha
>>
>> Best Regards,
>> Strahil Nikolov
>>
>>
>>
>>