hosted-engine engine crash

Hi, We have a 3 hosted-engine nodes setup using 2 NFS3 shares on which the engine keeps crashing every few days. Looking at VDSM logs, it looks like a storage problem but I'm wondering why don't they restart the engine?

Sorry, hit send by accident. More details : When I notice that the engine is down, if I type hosted-engine --vm-status on any hosts, it hangs and then writes a bunch of stuff saying it's down. If I type hosted-engine --vm-start on one of the hosts (Any), it just starts and gets back to business. hosted-engine --vm-status result : ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '3d67cf89-92de-428d-9714-e02aceae281e'}: Connection timed out Here's some logs from vdsm.log : Thread-98649::WARNING::2016-07-13 22:54:04,418::fileSD::749::Storage.scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/engine.domain.com:_var_lib_exports_iso Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 735, in collectMetaFiles sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob return self._iop.glob(pattern) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 534, in glob return self._sendCommand("glob", {"pattern": pattern}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 419, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out Thread-63::ERROR::2016-07-13 22:54:04,418::sdc::145::Storage.StorageDomainCache::(_findDomain) domain bd73cb0f-bb9c-432a-90ee-a32757a8bc10 not found Thread-98498::ERROR::2016-07-13 22:50:33,895::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out Thread-98498::ERROR::2016-07-13 22:50:33,895::API::1871::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1851, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain .format(sd_type, options, e)) RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '3d67cf89-92de-428d-9714-e02aceae281e'}: Connection timed out Thanks for your input, and even if it's a storage problem, if it's to happen, how can I force it to restart the engine? At first I tought it was a split-brain issue so I added a 3rd host but I still have the same problem. On Wed, Jul 13, 2016 at 11:13 PM, Mark Gagnon <rhubarbe@gmail.com> wrote:
Hi, We have a 3 hosted-engine nodes setup using 2 NFS3 shares on which the engine keeps crashing every few days.
Looking at VDSM logs, it looks like a storage problem but I'm wondering why don't they restart the engine?

More details, Hosts and Engine are running Centos7, ovirt version 3.6. NFS shares are on different servers (Not on the hosts)

Mark Gagnon <rhubarbe@gmail.com> writes:
even if it's a storage problem, if it's to happen, how can I force it to restart the engine?
Hi Mark, it indeed looks like a storage problem. Unfortunately, there's very little what can be done when storage is broken. I don't think there is any better option than to restart the Engine manually as you describe, once the storage is working again.
participants (2)
-
Mark Gagnon
-
Milan Zamazal