Hi,
I have the following two components:
1.) A freshly installed VM host (oVirt Node 4.4.0 release ISO)
2.) 3 storage hosts, also freshly installed from oVirt Node 4.4.0
release ISO
The storage hosts have been successfully installed with Gluster (through
Cockpit). They have two volumes, both of which I can mount and
read/write from a client.
On the VM host, I ran "hosted-engine --deploy" (no backups imported).
When prompted for storage, I answered "glusterfs" and specified
"storagehost1:/engine" as storage for the HE deployment. For mount
options, I specified
"backup-volfile-servers=storagehost2:/engine:storagehost3:/engine"
(Not the real hostnames, but all of them are resolvable via internal DNS)
Everything seems to works fine, I also see the "engine" volume become
populated with data. At some point I could ping and SSH login to the HE.
When the setup proceed to health check, it failed and the whole process
was aborted :-(
"hosted-engine --vm-status" reported "failed liveliness check" when it
was reachable via SSH. At some point the engine went down and, to my
surprise, shows a grub prompt after the restart when doing a
"hosted-engine --console".
[ INFO ] TASK [ovirt.hosted_engine_setup : Check engine VM health]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 180,
"changed":
true, "cmd": ["hosted-engine", "--vm-status",
"--json"], "delta":
"0:00:00.160595", "end": "2020-06-12 17:50:05.675774",
"rc": 0, "start":
"2020-06-12 17:50:05.515179", "stderr": "",
"stderr_lines": [],
"stdout": "{\"1\": {\"host-id\": 1,
\"host-ts\": 11528, \"score\": 3400,
\"engine-status\": {\"vm\": \"up\", \"health\":
\"bad\", \"detail\":
\"Powering down\", \"reason\": \"failed liveliness check\"},
\"hostname\": \"vmhost\", \"maintenance\": false,
\"stopped\": false,
\"crc32\": \"2c447835\", \"conf_on_shared_storage\": true,
\"local_conf_timestamp\": 11528, \"extra\":
\"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=11528
(Fri Jun 12 17:49:57
2020)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=11528 (Fri Jun 12
17:49:57
2020)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStop\\nstopped=False\\ntimeout=Thu
Jan 1 04:12:48 1970\\n\", \"live-data\": true},
\"global_maintenance\":
false}", "stdout_lines": ["{\"1\": {\"host-id\":
1, \"host-ts\": 11528,
\"score\": 3400, \"engine-status\": {\"vm\":
\"up\", \"health\":
\"bad\", \"detail\": \"Powering down\",
\"reason\": \"failed liveliness
check\"}, \"hostname\": \"vmhost\", \"maintenance\":
false, \"stopped\":
false, \"crc32\": \"2c447835\", \"conf_on_shared_storage\":
true,
\"local_conf_timestamp\": 11528, \"extra\":
\"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=11528
(Fri Jun 12 17:49:57
2020)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=11528 (Fri Jun 12
17:49:57
2020)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStop\\nstopped=False\\ntimeout=Thu
Jan 1 04:12:48 1970\\n\", \"live-data\": true},
\"global_maintenance\":
false}"]}
A second attempt failed at exactly the same stage.
I can see the following in the setup log:
ovirt-hosted-engine-setup-20200612151212-j9zwd2.log:
2020-06-12 17:33:18,314+0200 DEBUG
otopi.ovirt_hosted_engine_setup.ansible_utils
ansible_utils._process_output:103 {'msg': 'non-zero return code',
'cmd':
['hosted-engine', '--reinitialize-lockspace', '--force'],
'stdout': '',
'stderr': 'Traceback (most recent call last):\n
File "/usr/lib64/python3.6/runpy.py", line 193, in
_run_module_as_main\n "__main__", mod_spec)\n File
"/usr/lib64/python3.6/runpy.py", line 85, in _run_code\n exec(code,
run_globals)\n File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/reinitialize_
lockspace.py", line 30, in <module>\n ha_cli.reset_lockspace(force)\n
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 286, in reset_lockspace\n stats =
broker.get_stats_from_storage()\n File "/usr/lib/python3.6/site-packages/ov
irt_hosted_engine_ha/lib/brokerlink.py", line 148, in
get_stats_from_storage\n result = self._proxy.get_stats()\n File
"/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__\n return
self.__send(self.__name, args)\n File "/usr/lib64/python3.6/xmlrpc/client
.py", line 1452, in __request\n verbose=self.__verbose\n File
"/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request\n return
self.single_request(host, handler, request_body, verbose)\n File
"/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_requ
est\n http_conn = self.send_request(host, handler, request_body,
verbose)\n File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in
send_request\n self.send_content(connection, request_body)\n File
"/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_con
tent\n connection.endheaders(request_body)\n File
"/usr/lib64/python3.6/http/client.py", line 1249, in endheaders\n
self._send_output(message_body, encode_chunked=encode_chunked)\n File
"/usr/lib64/python3.6/http/client.py", line 1036, in _send_output\n
self.sen
d(msg)\n File "/usr/lib64/python3.6/http/client.py", line 974, in
send\n self.connect()\n File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py",
line 74, in connect\n
self.sock.connect(base64.b16decode(self.host))\nFileNotFoundError: [Errno
2] No such file or directory', 'rc': 1, 'start': '2020-06-12
17:33:17.982995', 'end': '2020-06-12 17:33:18.164197',
'delta':
'0:00:00.181202', 'changed': True, 'invocation':
{'module_args':
{'_raw_params': 'hosted-engine --reinitialize-lockspace --force',
'warn': True, '_u
ses_shell': False, 'stdin_add_newline': True, 'strip_empty_ends':
True,
'argv': None, 'chdir': None, 'executable': None,
'creates': None,
'removes': None, 'stdin': None}}, 'stdout_lines': [],
'stderr_lines':
['Traceback (most recent call last):', ' File "/usr/lib64/python
3.6/runpy.py", line 193, in _run_module_as_main', ' "__main__",
mod_spec)', ' File "/usr/lib64/python3.6/runpy.py", line 85, in
_run_code', ' exec(code, run_globals)', ' File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/reinitialize_lockspace.py",
li
ne 30, in <module>', ' ha_cli.reset_lockspace(force)', ' File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 286, in reset_lockspace', ' stats =
broker.get_stats_from_storage()', ' File
"/usr/lib/python3.6/site-packages/ovirt_hoste
d_engine_ha/lib/brokerlink.py", line 148, in get_stats_from_storage',
' result = self._proxy.get_stats()', ' File
"/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__', '
return self.__send(self.__name, args)', ' File
"/usr/lib64/python3.6/xmlrpc/client.
py", line 1452, in __request', ' verbose=self.__verbose', ' File
"/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request', '
return self.single_request(host, handler, request_body, verbose)', '
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in sing
le_request', ' http_conn = self.send_request(host, handler,
request_body, verbose)', ' File
"/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request',
' self.send_content(connection, request_body)', ' File
"/usr/lib64/python3.6/xmlrpc/client.py", line 13
09, in send_content', ' connection.endheaders(request_body)', ' File
"/usr/lib64/python3.6/http/client.py", line 1249, in endheaders', '
self._send_output(message_body, encode_chunked=encode_chunked)', ' File
"/usr/lib64/python3.6/http/client.py", line 1036, in _se
nd_output', ' self.send(msg)', ' File
"/usr/lib64/python3.6/http/client.py", line 974, in send', '
self.connect()', ' File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py",
line 74, in connect', ' self.sock.connect(base64.b16decode(self.h
ost))', 'FileNotFoundError: [Errno 2] No such file or directory'],
'_ansible_no_log': False, 'attempts': 5}
Bonus question: I can see the following in the gluster volume log on the
storage host:
[2020-06-12 17:30:38.368773] E [MSGID: 113072]
[posix-inode-fd-ops.c:1976:posix_writev] 0-engine-posix: write failed:
offset 0, [Invalid argument]
[2020-06-12 17:30:38.368832] E [MSGID: 115067]
[server-rpc-fops_v2.c:1372:server4_writev_cbk] 0-engine-server: 190586:
WRITEV 4 (baf5fbaa-0dc0-4ee4-bfd9-fa98f8853827), client:
CTX_ID:be1aca7a-5ba7-453d-94fd-d24f8c33511e-GRAPH_ID:0-PID:121217-HOST:vmhost-PC_NAME:engine-client-0-RECON_NO:-0,
error-xlator: engine-posix [Invalid argument]
There are some entries every few minutes like this (not like 100s per
minute). Since gluster is completely new for me, I'm still trying to
find out whether this might be related.
Thanks for any input!
Best regards
Oliver