hosted-engine engine crash - Users - oVirt List Archives

newer
The Foreman 1.12 now sets VM OS...

hosted-engine engine crash

older
Re: [ovirt-users] Slowness in the...

Mark Gagnon

14 Jul 2016 14 Jul '16

2:13 a.m.

Hi, We have a 3 hosted-engine nodes setup using 2 NFS3 shares on which the engine keeps crashing every few days. Looking at VDSM logs, it looks like a storage problem but I'm wondering why don't they restart the engine?

Attachments:

attachment.html (text/html — 299 bytes)

Reply

Sign in to reply online Use email software

Show replies by date

Mark Gagnon

14 Jul 14 Jul

2:20 a.m.

Sorry, hit send by accident. More details : When I notice that the engine is down, if I type hosted-engine --vm-status on any hosts, it hangs and then writes a bunch of stuff saying it's down. If I type hosted-engine --vm-start on one of the hosts (Any), it just starts and gets back to business. hosted-engine --vm-status result : ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '3d67cf89-92de-428d-9714-e02aceae281e'}: Connection timed out Here's some logs from vdsm.log : Thread-98649::WARNING::2016-07-13 22:54:04,418::fileSD::749::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/engine.domain.com:_var_lib_exports_iso Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 735, in collectMetaFiles sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob return self._iop.glob(pattern) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 534, in glob return self._sendCommand("glob", {"pattern": pattern}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 419, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out Thread-63::ERROR::2016-07-13 22:54:04,418::sdc::145::Storage.StorageDomainCache::(_findDomain) domain bd73cb0f-bb9c-432a-90ee-a32757a8bc10 not found Thread-98498::ERROR::2016-07-13 22:50:33,895::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out Thread-98498::ERROR::2016-07-13 22:50:33,895::API::1871::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1851, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain .format(sd_type, options, e)) RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '3d67cf89-92de-428d-9714-e02aceae281e'}: Connection timed out Thanks for your input, and even if it's a storage problem, if it's to happen, how can I force it to restart the engine? At first I tought it was a split-brain issue so I added a 3rd host but I still have the same problem. On Wed, Jul 13, 2016 at 11:13 PM, Mark Gagnon <rhubarbe@gmail.com> wrote:

Hi, We have a 3 hosted-engine nodes setup using 2 NFS3 shares on which the engine keeps crashing every few days.

Looking at VDSM logs, it looks like a storage problem but I'm wondering why don't they restart the engine?

Reply

Sign in to reply online Use email software

Mark Gagnon

1:59 p.m.

More details, Hosts and Engine are running Centos7, ovirt version 3.6. NFS shares are on different servers (Not on the hosts)

Reply

Sign in to reply online Use email software

Milan Zamazal

15 Jul 15 Jul

10:03 a.m.

Mark Gagnon <rhubarbe@gmail.com> writes:

even if it's a storage problem, if it's to happen, how can I force it to restart the engine?

Hi Mark, it indeed looks like a storage problem. Unfortunately, there's very little what can be done when storage is broken. I don't think there is any better option than to restart the Engine manually as you describe, once the storage is working again.

Reply

Sign in to reply online Use email software

3314

Age (days ago)

3315

Last active (days ago)

Download

3 comments

2 participants

tags

participants (2)

Mark Gagnon
Milan Zamazal