4.4.7 gluster quorum problem

12 Jul 2021

      Hi guys,
one strange thing happens, cannot understand it.
Today i installed last version 4.4.7 on centos stream, replica 3, via cockpit, internal lan for sync. Seems all ok, if reboot 3 nodes together aslo ok. But if i reboot 1 node ( and declare node rebooted through web ui) the bricks (engine and data) remain down on that node. All is clear on logs without explicit indication of the situation except "Server quorum regained for volume data. Starting local bricks" on glusterd.log. After "systemctl restart glusterd" bricks gone down on another node. After  "systemctl restart glusterd" on that node all is ok.
Where should i look?

some errors of log that i found:

--------------------------------------------- bdocker.log:
statusStorageThread::ERROR::2021-07-12 22:17:02,899::storage_broker::223::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(put_stats) Failed to write metadata for ho$
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 215, in put_stats
    f = os.open(path, direct_flag | os.O_WRONLY | os.O_SYNC)
FileNotFoundError: [Errno 2] No such file or directory: '/run/vdsm/storage/53b068c1-beb8-4048-a766-3a4e71ded624/d3df7eb6-d453-439a-8436-d3694d4b5179/de18b2cc-a4e1-4afc-9b5a-6063$
StatusStorageThread::ERROR::2021-07-12 22:17:02,899::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state.
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 215, in put_stats
    f = os.open(path, direct_flag | os.O_WRONLY | os.O_SYNC)
FileNotFoundError: [Errno 2] No such file or directory: '/run/vdsm/storage/53b068c1-beb8-4048-a766-3a4e71ded624/d3df7eb6-d453-439a-8436-d3694d4b5179/de18b2cc-a4e1-4afc-9b5a-6063$

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 86, in run
    entry.data
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 225, in put_stats
    .format(str(e)))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: failed to write metadata: [Errno 2] No such file or directory: '/run/vdsm/storage/53b068c1-beb8-4048-a766-3a4e71ded624/d3df7e$
StatusStorageThread::ERROR::2021-07-12 22:17:02,899::storage_broker::223::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(put_stats) Failed to write metadata for ho$
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 215, in put_stats
    f = os.open(path, direct_flag | os.O_WRONLY | os.O_SYNC)
FileNotFoundError: [Errno 2] No such file or directory: '/run/vdsm/storage/53b068c1-beb8-4048-a766-3a4e71ded624/d3df7eb6-d453-439a-8436-d3694d4b5179/de18b2cc-a4e1-4afc-9b5a-6063$
StatusStorageThread::ERROR::2021-07-12 22:17:02,899::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state.
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 215, in put_stats
    f = os.open(path, direct_flag | os.O_WRONLY | os.O_SYNC)
FileNotFoundError: [Errno 2] No such file or directory: '/run/vdsm/storage/53b068c1-beb8-4048-a766-3a4e71ded624/d3df7eb6-d453-439a-8436-d3694d4b5179/de18b2cc-a4e1-4afc-9b5a-6063$

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 86, in run
    entry.data
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 225, in put_stats
    .format(str(e)))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: failed to write metadata: [Errno 2] No such file or directory: '/run/vdsm/storage/53b068c1-beb8-4048-a766-3a4e71ded624/d3df7e$
StatusStorageThread::ERROR::2021-07-12 22:17:02,902::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart) Trying to restart the $
StatusStorageThread::ERROR::2021-07-12 22:17:02,902::storage_broker::173::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata fro$
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats
    f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)
FileNotFoundError: [Errno 2] No such file or directory: '/run/vdsm/storage/53b068c1-beb8-4048-a766-3a4e71ded624/d3df7eb6-d453-439a-8436-d3694d4b5179/de18b2cc-a4e1-4afc-9b5a-6063$
StatusStorageThread::ERROR::2021-07-12 22:17:02,902::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats
    f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)
FileNotFoundError: [Errno 2] No such file or directory: '/run/vdsm/storage/53b068c1-beb8-4048-a766-3a4e71ded624/d3df7eb6-d453-439a-8436-d3694d4b5179/de18b2cc-a4e1-4afc-9b5a-6063$

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run
    self._storage_broker.get_raw_stats()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 175, in get_raw_stats
    .format(str(e)))

-------------------------------------------------------supervdsm.log:
ainProcess|jsonrpc/0::DEBUG::2021-07-12 22:22:13,264::commands::211::root::(execCmd) /usr/bin/taskset --cpu-list 0-19 /usr/bin/systemd-run --scope --slice=vdsm-glusterfs /usr/b$
MainProcess|jsonrpc/0::DEBUG::2021-07-12 22:22:15,083::commands::224::root::(execCmd) FAILED: <err> = b'Running scope as unit: run-r91d6411af8114090aa28933d562fa473.scope\nMount$
MainProcess|jsonrpc/0::ERROR::2021-07-12 22:22:15,083::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) Error in mount
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 97, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 135, in mount
    cgroup=cgroup)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line 280, in _mount
    _runcmd(cmd)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line 308, in _runcmd
    raise MountError(cmd, rc, out, err)
vdsm.storage.mount.MountError: Command ['/usr/bin/systemd-run', '--scope', '--slice=vdsm-glusterfs', '/usr/bin/mount', '-t', 'glusterfs', '-o', 'backup-volfile-servers=cluster2.$
netlink/events::DEBUG::2021-07-12 22:22:15,131::concurrent::261::root::(run) FINISH thread <Thread(netlink/events, stopped daemon 139867781396224)>
MainProcess|jsonrpc/4::DEBUG::2021-07-12 22:22:15,134::supervdsm_server::102::SuperVdsm.ServerCallback::(wrapper) return network_caps with {'networks': {'ovirtmgmt': {'ports': [$

---------------------------------------------------------vdsm.log:
2021-07-12 22:17:08,718+0200 INFO  (jsonrpc/7) [api.host] FINISH getStats return={'status': {'code': 0, 'message': 'Done'}, 'info': (suppressed)} from=::1,34946 (api:54)
2021-07-12 22:17:09,491+0200 ERROR (monitor/53b068c) [storage.Monitor] Error checking domain 53b068c1-beb8-4048-a766-3a4e71ded624 (monitor:451)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 432, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 712, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 244, in statvfs
    return self._iop.statvfs(path)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 510, in statvfs
    resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in _sendCommand
    raise OSError(errcode, errstr)
FileNotFoundError: [Errno 2] No such file or directory
2021-07-12 22:17:09,619+0200 INFO  (jsonrpc/0) [api.virt] START getStats() from=::1,34946, vmId=9167f682-3c82-4237-93bd-53f0ad32ffba (api:48)
2021-07-12 22:17:09,620+0200 INFO  (jsonrpc/0) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': '9167f682-3c82-4237-93bd-53f0ad32ffba'} (api:129)
2021-07-12 22:17:09,620+0200 INFO  (jsonrpc/0) [api.virt] FINISH getStats return={'status': {'code': 1, 'message': "Virtual machine does not exist: {'vmId': '9167f682-3c82-4237-$
2021-07-12 22:17:09,620+0200 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.getStats failed (error 1) in 0.00 seconds (__init__:312)
2021-07-12 22:17:10,034+0200 INFO  (jsonrpc/3) [vdsm.api] START repoStats(domains=['53b068c1-beb8-4048-a766-3a4e71ded624']) from=::1,34946, task_id=4e823c98-f95b-45f7-ad64-90f82$
2021-07-12 22:17:10,034+0200 INFO  (jsonrpc/3) [vdsm.api] FINISH repoStats return={'53b068c1-beb8-4048-a766-3a4e71ded624': {'code': 2001, 'lastCheck': '0.5', 'delay': '0', 'vali$
2021-07-12 22:17:10,403+0200 INFO  (health) [health] LVM cache hit ratio: 12.50% (hits: 1 misses: 7) (health:131)
2021-07-12 22:17:10,472+0200 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:74)
2021-07-12 22:17:10,472+0200 INFO  (MainThread) [root] Stopping DHCP monitor. (dhcp_monitor:106)
2021-07-12 22:17:10,473+0200 INFO  (ioprocess/11056) [IOProcessClient] (53b068c1-beb8-4048-a766-3a4e71ded624) Poll error 16 on fd 74 (__init__:176)
2021-07-12 22:17:10,473+0200 INFO  (ioprocess/11056) [IOProcessClient] (53b068c1-beb8-4048-a766-3a4e71ded624) ioprocess was terminated by signal 15 (__init__:200)
2021-07-12 22:17:10,476+0200 INFO  (ioprocess/19109) [IOProcessClient] (e10cbd59-d32e-4b69-a4c1-d213e7bd8973) Poll error 16 on fd 75 (__init__:176)
2021-07-12 22:17:10,476+0200 INFO  (ioprocess/19109) [IOProcessClient] (e10cbd59-d32e-4b69-a4c1-d213e7bd8973) ioprocess was terminated by signal 15 (__init__:200)
2021-07-12 22:17:10,513+0200 INFO  (ioprocess/44046) [IOProcess] (e10cbd59-d32e-4b69-a4c1-d213e7bd8973) Starting ioprocess (__init__:465)
2021-07-12 22:17:10,513+0200 INFO  (ioprocess/44045) [IOProcess] (53b068c1-beb8-4048-a766-3a4e71ded624) Starting ioprocess (__init__:465)
2021-07-12 22:17:10,519+0200 WARN  (periodic/0) [root] Failed to retrieve Hosted Engine HA info: timed out (api:198)
2021-07-12 22:17:10,611+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/cluster1.int:_data/e10cbd59-d32e-4b69-a4c1-d213e7bd8973/dom$
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 507, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/check.py", line 391, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
vdsm.storage.exception.MiscFileReadException: Internal file read failure: ('/rhev/data-center/mnt/glusterSD/cluster1.int:_data/e10cbd59-d32e-4b69-a4c1-d213e7bd8973/dom_md/metada$
2021-07-12 22:17:10,860+0200 INFO  (MainThread) [root] Stopping Bond monitor. (bond_monitor:53)

Thanks in advance
Best regards.

radchenko.anatoliy＠gmail.com

Strahil Nikolov

radchenko.anatoliy＠gmail.com

radchenko.anatoliy＠gmail.com

Strahil Nikolov

tags

participants (2)