
Ok, finally fixed my ovirt engine, and its trying to run...but it turns out that its seeing 60-90% iowait, and is about to be killed off by the ha agent because it can't get fully online in time. I did make the changes recommended about locks, and it doesn't seem to have helped things any. The 3rd server (with the bad brick) is powered off at this point, so I have two servers, each with replicas of all volumes, and the 3rd server not even responding to pings... --Jim On Wed, May 30, 2018 at 1:53 AM, Jim Kusznir <jim@palousetech.com> wrote:
Ok, made it through this one finally. Turns out there's a little log file "sanlock.log" that had the magic info in it...the ...dom_md/ids file got messed with in the gluster crash, had to recover that from a backup, then run the hosted-engine --reinitialize-lockspace --force, and then it started working again. Now, for some reason, my hosted engine starts up, but has no networking and the hosted-engine --console doesn't work either...Still looking into this now.
On Wed, May 30, 2018 at 1:29 AM, Jim Kusznir <jim@palousetech.com> wrote:
So, right now I think this message from vdsm.log captures my blocking problem:
2018-05-30 01:27:22,326-0700 INFO (jsonrpc/5) [vdsm.api] FINISH repoStats return={u'c0acdefb-7d16-48ec-9d76-659b8fe33e2a': {'code': 0, 'actual': True, 'version': 4, 'acquired': False, 'delay': '0.00142076', 'lastCheck': '1.1', 'valid': True}} from=::1,60684, task_id=7da641b6-9a98-4db7-aafb-04ff84228cf5 (api:52) 2018-05-30 01:27:22,326-0700 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getStorageRepoStats succeeded in 0.00 seconds (__init__:573) 2018-05-30 01:27:23,411-0700 ERROR (vm/50392390) [virt.vm] (vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Failed to acquire lock: No space left on device 2018-05-30 01:27:23,411-0700 INFO (vm/50392390) [virt.vm] (vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') Changed state to Down: Failed to acquire lock: No space left on device (code=1) (vm:1683)
The strange thing is I can't figure out what device has no space left. The only hits on google talk about deleting __DIRECT_IO_TEST__ file in the storage repo, which I tried, but it did not help. Perhaps one of the files in the engine volume stores locks, and that is full/corrupt?
--Jim
On Wed, May 30, 2018 at 12:25 AM, Jim Kusznir <jim@palousetech.com> wrote:
I tried to capture a consistent log slice accros the cluster from agent.log. Several systems claim they are starting the engine, but virsh never shows any sign of it. They also claim that a uuid that I'm pretty sure is not affiliated with my engine is not found.
--------- ovirt1: --------- MainThread::INFO::2018-05-30 00:18:11,628::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineForceStop-EngineDown) sent? sent [root@ovirt1 vdsm]# tail -n 20 ../ovirt-hosted-engine-ha/agent.log MainThread::INFO::2018-05-30 00:14:02,154::hosted_engine::4 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-05-30 00:14:02,173::states::776::ovi rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Another host already took over.. MainThread::INFO::2018-05-30 00:14:02,174::state_decorators ::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineForceStop'> MainThread::INFO::2018-05-30 00:14:02,533::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineForceStop) sent? sent MainThread::INFO::2018-05-30 00:14:02,843::ovf_store::151:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929 MainThread::INFO::2018-05-30 00:18:09,236::hosted_engine::4 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineForceStop (score: 3400) MainThread::INFO::2018-05-30 00:18:09,236::hosted_engine::4 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ovirt3.nwfiber.com (id: 3, score: 3400) MainThread::INFO::2018-05-30 00:18:09,253::hosted_engine::1007:: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff` MainThread::INFO::2018-05-30 00:18:11,159::hosted_engine::1012:: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: MainThread::INFO::2018-05-30 00:18:11,160::hosted_engine::1013:: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2018-05-30 00:18:11,160::hosted_engine::1021:: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost MainThread::INFO::2018-05-30 00:18:11,628::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineForceStop-EngineDown) sent? sent MainThread::INFO::2018-05-30 00:22:17,692::hosted_engine::4 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host ovirt2.nwfiber.com (id 2): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4348 (Wed May 30 00:16:46 2018)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=4349 (Wed May 30 00:16:47 2018)\nconf_on_shared_storage= True\nmaintenance=False\nstate=EngineDown\nstopped=False\n', 'hostname': 'ovirt2.nwfiber.com', 'alive': True, 'host-id': 2, 'engine-status': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'down_unexpected', 'detail': 'Down'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': '5ee0a557', 'local_conf_timestamp': 4349, 'host-ts': 4348} MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host ovirt3.nwfiber.com (id 3): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=100036 (Wed May 30 00:16:58 2018)\nhost-id=3\nscore=3400\nvm_conf_refresh_time=100037 (Wed May 30 00:16:58 2018)\nconf_on_shared_storage= True\nmaintenance=False\nstate=EngineStart\nstopped=False\n', 'hostname': 'ovirt3.nwfiber.com', 'alive': True, 'host-id': 3, 'engine-status': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'down_unexpected', 'detail': 'Down'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': 'e2fadf31', 'local_conf_timestamp': 100037, 'host-ts': 100036} MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1 77::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'bridge': True, 'mem-free': 30186.0, 'maintenance': False, 'cpu-load': 0.0308, 'gateway': 1.0, 'storage-domain': True} MainThread::INFO::2018-05-30 00:22:17,714::states::508::ovi rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM MainThread::INFO::2018-05-30 00:22:18,309::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? sent MainThread::INFO::2018-05-30 00:22:18,624::ovf_store::151:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
---------- ovirt2 ---------- MainThread::INFO::2018-05-30 00:08:34,534::hosted_engine::4 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0) MainThread::INFO::2018-05-30 00:08:34,534::hosted_engine::4 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ovirt3.nwfiber.com (id: 3, score: 3400) MainThread::INFO::2018-05-30 00:08:34,545::states::678::ovi rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score MainThread::INFO::2018-05-30 00:12:40,627::states::693::ovi rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Wed May 30 00:04:28 2018 MainThread::INFO::2018-05-30 00:12:40,628::hosted_engine::4 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0) MainThread::INFO::2018-05-30 00:12:40,628::hosted_engine::4 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ovirt1.nwfiber.com (id: 1, score: 3400) MainThread::INFO::2018-05-30 00:12:40,643::states::678::ovi rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score MainThread::INFO::2018-05-30 00:12:40,827::ovf_store::151:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929 MainThread::INFO::2018-05-30 00:16:46,635::states::693::ovi rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Wed May 30 00:04:28 2018 MainThread::INFO::2018-05-30 00:16:46,636::hosted_engine::4 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0) MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host ovirt1.nwfiber.com (id 1): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=7614 (Wed May 30 00:09:54 2018)\nhost-id=1\nscore=3400\nvm_conf_refresh_time=7614 (Wed May 30 00:09:54 2018)\nconf_on_shared_storage= True\nmaintenance=False\nstate=EngineStarting\nstopped=False\n', 'hostname': 'ovirt1.nwfiber.com', 'alive': True, 'host-id': 1, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': '1fa13e84', 'local_conf_timestamp': 7614, 'host-ts': 7614} MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host ovirt3.nwfiber.com (id 3): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=99541 (Wed May 30 00:08:43 2018)\nhost-id=3\nscore=3400\nvm_conf_refresh_time=99542 (Wed May 30 00:08:43 2018)\nconf_on_shared_storage= True\nmaintenance=False\nstate=EngineForceStop\nstopped=False\n', 'hostname': 'ovirt3.nwfiber.com', 'alive': True, 'host-id': 3, 'engine-status': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'down', 'detail': 'Down'}, 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': '56f87e2c', 'local_conf_timestamp': 99542, 'host-ts': 99541} MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1 77::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'down_unexpected', 'detail': 'Down'}, 'bridge': True, 'mem-free': 30891.0, 'maintenance': False, 'cpu-load': 0.0186, 'gateway': 1.0, 'storage-domain': True} MainThread::INFO::2018-05-30 00:16:46,919::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUnexpectedlyDown-EngineDown) sent? sent MainThread::INFO::2018-05-30 00:20:52,918::hosted_engine::4 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-05-30 00:20:52,918::hosted_engine::4 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ovirt1.nwfiber.com (id: 1, score: 3400) MainThread::INFO::2018-05-30 00:20:52,930::states::508::ovi rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM MainThread::INFO::2018-05-30 00:20:53,203::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? sent MainThread::INFO::2018-05-30 00:20:53,393::ovf_store::151:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
-------- ovirt3 ------- MainThread::INFO::2018-05-30 00:21:07,343::hosted_engine::9 29::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` MainThread::INFO::2018-05-30 00:21:08,107::hosted_engine::9 35::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9 36::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Command VM.getStats with args {'vmID': '50392390-4f89-435c-bf3b-8254b58a4ef7'} failed: (code=1, message=Virtual machine does not exist: {'vmId': u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9 48::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2018-05-30 00:21:08,413::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent [root@ovirt3 images]# tail -n 20 /var/log/ovirt-hosted-engine-h a/agent.log MainThread::INFO::2018-05-30 00:12:51,792::hosted_engine::1012:: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: MainThread::INFO::2018-05-30 00:12:51,792::hosted_engine::1013:: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2018-05-30 00:12:51,793::hosted_engine::1021:: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost MainThread::INFO::2018-05-30 00:12:52,157::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineForceStop-EngineDown) sent? sent MainThread::INFO::2018-05-30 00:16:58,237::hosted_engine::4 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-05-30 00:16:58,256::states::508::ovi rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM MainThread::INFO::2018-05-30 00:16:58,595::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? sent MainThread::INFO::2018-05-30 00:16:58,816::ovf_store::151:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929 MainThread::INFO::2018-05-30 00:21:05,082::hosted_engine::4 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStart (score: 3400) MainThread::INFO::2018-05-30 00:21:05,083::hosted_engine::4 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ovirt1.nwfiber.com (id: 1, score: 3400) MainThread::INFO::2018-05-30 00:21:05,125::hosted_engine::9 68::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Ensuring VDSM state is clear for engine VM MainThread::INFO::2018-05-30 00:21:05,141::hosted_engine::9 88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Cleaning state for non-running VM MainThread::INFO::2018-05-30 00:21:07,343::hosted_engine::9 80::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Vdsm state for VM clean MainThread::INFO::2018-05-30 00:21:07,343::hosted_engine::9 29::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` MainThread::INFO::2018-05-30 00:21:08,107::hosted_engine::9 35::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9 36::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Command VM.getStats with args {'vmID': '50392390-4f89-435c-bf3b-8254b58a4ef7'} failed: (code=1, message=Virtual machine does not exist: {'vmId': u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9 48::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2018-05-30 00:21:08,413::brokerlink::68:: ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? sent
---------- note that virsh does not show engine started on this machine....
On Wed, May 30, 2018 at 12:16 AM, Jim Kusznir <jim@palousetech.com> wrote:
So, on ovirt2, I ran the hosted-engine --vm-status and it took over 2 minutes to get it to return, but two of the 3 machines had current (non-stale) data:
[root@ovirt2 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True Status up-to-date : False Hostname : ovirt1.nwfiber.com Host ID : 1 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : f562054b local_conf_timestamp : 7368 Host timestamp : 7367 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=7367 (Wed May 30 00:05:47 2018) host-id=1 score=3400 vm_conf_refresh_time=7368 (Wed May 30 00:05:48 2018) conf_on_shared_storage=True maintenance=False state=EngineStart stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt2.nwfiber.com Host ID : 2 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down_unexpected", "detail": "Down"} Score : 0 stopped : False Local maintenance : False crc32 : 6822f86a local_conf_timestamp : 3610 Host timestamp : 3610 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3610 (Wed May 30 00:04:28 2018) host-id=2 score=0 vm_conf_refresh_time=3610 (Wed May 30 00:04:28 2018) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Wed Dec 31 17:09:30 1969
--== Host 3 status ==--
conf_on_shared_storage : True Status up-to-date : True Hostname : ovirt3.nwfiber.com Host ID : 3 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 006512a3 local_conf_timestamp : 99294 Host timestamp : 99294 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=99294 (Wed May 30 00:04:36 2018) host-id=3 score=3400 vm_conf_refresh_time=99294 (Wed May 30 00:04:36 2018) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False ---------------
I saw that the ha score was 0 on ovirt2 (where I was focusing my efforts), and ovirt3 had a good ha score, I tried to launch it there:
[root@ovirt3 ff3d0185-d120-40ee-9792-ebbd5caaca8b]# hosted-engine --vm-start Command VM.getStats with args {'vmID': '50392390-4f89-435c-bf3b-8254b58a4ef7'} failed: (code=1, message=Virtual machine does not exist: {'vmId': u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
ovirt1 was returning the ha-agent or storage not online message, but now its hanging (probably 2 minutes), so it may actually return status this time. I figure if it can't return status, then there's no point in trying to start the VM there.
On Wed, May 30, 2018 at 12:08 AM, Jim Kusznir <jim@palousetech.com> wrote:
The agent..log was sent minutes ago in another message. Here's the gluster client log for the engine volume:
[2018-05-30 06:31:12.343709] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-engine-replicate-0: performing metadata selfheal on 3a430e50-48cb-463c-8657-aab5458ac155 [2018-05-30 06:31:12.348210] W [MSGID: 108015] [afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-engine-replicate-0: expunging file 117c5234-e6d6-4f03-809b-a4ef6a6fac4d/hosted-engine.metadata (38a277cd-113d-4ca8-8f48-fadc944d3272) on engine-client-1 [2018-05-30 06:31:12.360602] I [MSGID: 108026] [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0: Completed metadata selfheal on 3a430e50-48cb-463c-8657-aab5458ac155. sources=[0] 2 sinks=1 [2018-05-30 06:31:12.379169] I [MSGID: 108026] [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0: Completed entry selfheal on 117c5234-e6d6-4f03-809b-a4ef6a6fac4d. sources=[0] 2 sinks=1 [2018-05-30 06:32:54.756729] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-engine-replicate-0: performing metadata selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f [2018-05-30 06:32:54.763175] I [MSGID: 108026] [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0: Completed metadata selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f. sources=[0] 2 sinks=1 [2018-05-30 06:37:06.581538] I [MSGID: 108026] [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0: Completed data selfheal on 06c557a9-b550-4e25-b190-804e50cea65b. sources=[0] 2 sinks=1 [2018-05-30 06:39:55.739302] I [MSGID: 108026] [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0: Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f. sources=[0] 2 sinks=1 [2018-05-30 06:44:04.911825] I [MSGID: 108026] [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0: Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f. sources=[0] 2 sinks=1 [2018-05-30 06:58:05.123854] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-engine-replicate-0: performing metadata selfheal on 03ca7a59-e8ee-4a70-bc68-47e972aeb801 [2018-05-30 06:58:05.131929] I [MSGID: 108026] [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0: Completed metadata selfheal on 03ca7a59-e8ee-4a70-bc68-47e972aeb801. sources=[0] 2 sinks=1 [2018-05-30 07:01:42.583991] I [MSGID: 108026] [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0: Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f. sources=[0] 2 sinks=1 [2018-05-30 07:05:48.777982] I [MSGID: 108026] [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0: Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f. sources=[0] 2 sinks=1
-------- broker log: -------- Thread-1::INFO::2018-05-30 00:07:13,241::ping::60::ping.Ping::(action) Successfully pinged 192.168.8.1 Listener::INFO::2018-05-30 00:07:14,344::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:07:19,356::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Thread-5::INFO::2018-05-30 00:07:22,081::engine_health::2 02::engine_health.EngineHealth::(_result_from_stats) VM not running on this host, status Down Listener::INFO::2018-05-30 00:07:24,369::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:07:29,382::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:07:34,393::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:07:39,406::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:07:44,419::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:07:49,431::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:07:54,444::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:07:59,452::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Thread-2::INFO::2018-05-30 00:08:03,736::mgmt_bridge::62: :mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Listener::INFO::2018-05-30 00:08:04,460::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:08:09,468::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Thread-4::INFO::2018-05-30 00:08:11,461::cpu_load_no_engi ne::116::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) VM stats do not contain cpu usage. VM might be down. Thread-4::INFO::2018-05-30 00:08:11,461::cpu_load_no_engi ne::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.0169, engine=0.0000, non-engine=0.0169 Listener::INFO::2018-05-30 00:08:14,481::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Listener::INFO::2018-05-30 00:08:19,490::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING Thread-5::INFO::2018-05-30 00:08:22,175::engine_health::2 02::engine_health.EngineHealth::(_result_from_stats) VM not running on this host, status Down Thread-1::INFO::2018-05-30 00:08:22,436::ping::60::ping.Ping::(action) Successfully pinged 192.168.8.1 Listener::INFO::2018-05-30 00:08:24,503::storage_broker::304:: ovirt_hosted_engine_ha.broker.storage_broker.StorageBro ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
On Wed, May 30, 2018 at 12:03 AM, Sahina Bose <sabose@redhat.com> wrote:
[Adding gluster-users back]
Nothing amiss with volume info and status. Can you check the agent.log and broker.log - will be under /var/log/ovirt-hosted-engine-ha/
Also the gluster client logs - under /var/log/glusterfs/rhev-data-c enter-mnt-glusterSD<volume>.log
On Wed, May 30, 2018 at 12:08 PM, Jim Kusznir <jim@palousetech.com> wrote:
> I believe the gluster data store for the engine is up and working > correctly. The rest are not mounted, as the engine hasn't managed to start > correctly yet. I did perform the copy-a-junk-file into the data store, > then check to ensure its there on another host, then delete that and see > that it disappeared on the first host; it passed that test. Here's the > info and status. (I have NOT performed the steps that Krutika and > Ravishankar suggested yet, as I don't have my data volumes working again > yet. > > [root@ovirt2 images]# gluster volume info > > Volume Name: data > Type: Replicate > Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.nwfiber.com:/gluster/brick2/data > Brick2: ovirt2.nwfiber.com:/gluster/brick2/data > Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter) > Options Reconfigured: > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > changelog.changelog: on > geo-replication.ignore-pid-check: on > geo-replication.indexing: on > server.allow-insecure: on > performance.readdir-ahead: on > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.stat-prefetch: off > cluster.eager-lock: enable > network.remote-dio: enable > cluster.quorum-type: auto > cluster.server-quorum-type: server > storage.owner-uid: 36 > storage.owner-gid: 36 > features.shard: on > features.shard-block-size: 512MB > performance.low-prio-threads: 32 > cluster.data-self-heal-algorithm: full > cluster.locking-scheme: granular > cluster.shd-wait-qlength: 10000 > cluster.shd-max-threads: 8 > network.ping-timeout: 30 > user.cifs: off > nfs.disable: on > performance.strict-o-direct: on > > Volume Name: data-hdd > Type: Replicate > Volume ID: d342a3ab-16f3-49f0-bbcf-f788be8ac5f1 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 172.172.1.11:/gluster/brick3/data-hdd > Brick2: 172.172.1.12:/gluster/brick3/data-hdd > Brick3: 172.172.1.13:/gluster/brick3/data-hdd > Options Reconfigured: > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > network.ping-timeout: 30 > server.allow-insecure: on > storage.owner-gid: 36 > storage.owner-uid: 36 > user.cifs: off > features.shard: on > cluster.shd-wait-qlength: 10000 > cluster.shd-max-threads: 8 > cluster.locking-scheme: granular > cluster.data-self-heal-algorithm: full > cluster.server-quorum-type: server > cluster.quorum-type: auto > cluster.eager-lock: enable > network.remote-dio: enable > performance.low-prio-threads: 32 > performance.stat-prefetch: off > performance.io-cache: off > performance.read-ahead: off > performance.quick-read: off > changelog.changelog: on > geo-replication.ignore-pid-check: on > geo-replication.indexing: on > transport.address-family: inet > performance.readdir-ahead: on > > Volume Name: engine > Type: Replicate > Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine > Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine > Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter) > Options Reconfigured: > changelog.changelog: on > geo-replication.ignore-pid-check: on > geo-replication.indexing: on > performance.readdir-ahead: on > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.stat-prefetch: off > cluster.eager-lock: enable > network.remote-dio: off > cluster.quorum-type: auto > cluster.server-quorum-type: server > storage.owner-uid: 36 > storage.owner-gid: 36 > features.shard: on > features.shard-block-size: 512MB > performance.low-prio-threads: 32 > cluster.data-self-heal-algorithm: full > cluster.locking-scheme: granular > cluster.shd-wait-qlength: 10000 > cluster.shd-max-threads: 6 > network.ping-timeout: 30 > user.cifs: off > nfs.disable: on > performance.strict-o-direct: on > > Volume Name: iso > Type: Replicate > Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso > Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso > Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter) > Options Reconfigured: > performance.strict-o-direct: on > nfs.disable: on > user.cifs: off > network.ping-timeout: 30 > cluster.shd-max-threads: 6 > cluster.shd-wait-qlength: 10000 > cluster.locking-scheme: granular > cluster.data-self-heal-algorithm: full > performance.low-prio-threads: 32 > features.shard-block-size: 512MB > features.shard: on > storage.owner-gid: 36 > storage.owner-uid: 36 > cluster.server-quorum-type: server > cluster.quorum-type: auto > network.remote-dio: off > cluster.eager-lock: enable > performance.stat-prefetch: off > performance.io-cache: off > performance.read-ahead: off > performance.quick-read: off > performance.readdir-ahead: on > > > [root@ovirt2 images]# gluster volume status > Status of volume: data > Gluster process TCP Port RDMA Port > Online Pid > ------------------------------------------------------------ > ------------------ > Brick ovirt1.nwfiber.com:/gluster/brick2/da > ta 49152 0 Y > 3226 > Brick ovirt2.nwfiber.com:/gluster/brick2/da > ta 49152 0 Y > 2967 > Brick ovirt3.nwfiber.com:/gluster/brick2/da > ta 49152 0 Y > 2554 > Self-heal Daemon on localhost N/A N/A Y > 4818 > Self-heal Daemon on ovirt3.nwfiber.com N/A N/A Y > 17853 > Self-heal Daemon on ovirt1.nwfiber.com N/A N/A Y > 4771 > > Task Status of Volume data > ------------------------------------------------------------ > ------------------ > There are no active volume tasks > > Status of volume: data-hdd > Gluster process TCP Port RDMA Port > Online Pid > ------------------------------------------------------------ > ------------------ > Brick 172.172.1.11:/gluster/brick3/data-hdd 49153 0 Y > 3232 > Brick 172.172.1.12:/gluster/brick3/data-hdd 49153 0 Y > 2976 > Brick 172.172.1.13:/gluster/brick3/data-hdd N/A N/A N > N/A > NFS Server on localhost N/A N/A N > N/A > Self-heal Daemon on localhost N/A N/A Y > 4818 > NFS Server on ovirt3.nwfiber.com N/A N/A N > N/A > Self-heal Daemon on ovirt3.nwfiber.com N/A N/A Y > 17853 > NFS Server on ovirt1.nwfiber.com N/A N/A N > N/A > Self-heal Daemon on ovirt1.nwfiber.com N/A N/A Y > 4771 > > Task Status of Volume data-hdd > ------------------------------------------------------------ > ------------------ > There are no active volume tasks > > Status of volume: engine > Gluster process TCP Port RDMA Port > Online Pid > ------------------------------------------------------------ > ------------------ > Brick ovirt1.nwfiber.com:/gluster/brick1/en > gine 49154 0 Y > 3239 > Brick ovirt2.nwfiber.com:/gluster/brick1/en > gine 49154 0 Y > 2982 > Brick ovirt3.nwfiber.com:/gluster/brick1/en > gine 49154 0 Y > 2578 > Self-heal Daemon on localhost N/A N/A Y > 4818 > Self-heal Daemon on ovirt3.nwfiber.com N/A N/A Y > 17853 > Self-heal Daemon on ovirt1.nwfiber.com N/A N/A Y > 4771 > > Task Status of Volume engine > ------------------------------------------------------------ > ------------------ > There are no active volume tasks > > Status of volume: iso > Gluster process TCP Port RDMA Port > Online Pid > ------------------------------------------------------------ > ------------------ > Brick ovirt1.nwfiber.com:/gluster/brick4/is > o 49155 0 Y > 3247 > Brick ovirt2.nwfiber.com:/gluster/brick4/is > o 49155 0 Y > 2990 > Brick ovirt3.nwfiber.com:/gluster/brick4/is > o 49155 0 Y > 2580 > Self-heal Daemon on localhost N/A N/A Y > 4818 > Self-heal Daemon on ovirt3.nwfiber.com N/A N/A Y > 17853 > Self-heal Daemon on ovirt1.nwfiber.com N/A N/A Y > 4771 > > Task Status of Volume iso > ------------------------------------------------------------ > ------------------ > There are no active volume tasks > > > On Tue, May 29, 2018 at 11:30 PM, Sahina Bose <sabose@redhat.com> > wrote: > >> >> >> On Wed, May 30, 2018 at 10:42 AM, Jim Kusznir <jim@palousetech.com> >> wrote: >> >>> hosted-engine --deploy failed (would not come up on my existing >>> gluster storage). However, I realized no changes were written to my >>> existing storage. So, I went back to trying to get my old engine running. >>> >>> hosted-engine --vm-status is now taking a very long time >>> (5+minutes) to return, and it returns stail information everywhere. I >>> thought perhaps the lockspace is corrupt, so tried to clean that and >>> metadata, but both are failing (--cleam-metadata has hung and I can't even >>> ctrl-c out of it). >>> >>> How can I reinitialize all the lockspace/metadata safely? There >>> is no engine or VMs running currently.... >>> >> >> I think the first thing to make sure is that your storage is up and >> running. So can you mount the gluster volumes and able to access the >> contents there? >> Please provide the gluster volume info and gluster volume status of >> the volumes that you're using. >> >> >> >>> --Jim >>> >>> On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir <jim@palousetech.com> >>> wrote: >>> >>>> Well, things went from bad to very, very bad.... >>>> >>>> It appears that during one of the 2 minute lockups, the fencing >>>> agents decided that another node in the cluster was down. As a result, 2 >>>> of the 3 nodes were simultaneously reset with fencing agent reboot. After >>>> the nodes came back up, the engine would not start. All running VMs >>>> (including VMs on the 3rd node that was not rebooted) crashed. >>>> >>>> I've now been working for about 3 hours trying to get the engine >>>> to come up. I don't know why it won't start. hosted-engine --vm-start >>>> says its starting, but it doesn't start (virsh doesn't show any VMs >>>> running). I'm currently running --deploy, as I had run out of options for >>>> anything else I can come up with. I hope this will allow me to re-import >>>> all my existing VMs and allow me to start them back up after everything >>>> comes back up. >>>> >>>> I do have an unverified geo-rep backup; I don't know if it is a >>>> good backup (there were several prior messages to this list, but I didn't >>>> get replies to my questions. It was running in what I believe to be >>>> "strange", and the data directories are larger than their source). >>>> >>>> I'll see if my --deploy works, and if not, I'll be back with >>>> another message/help request. >>>> >>>> When the dust settles and I'm at least minimally functional >>>> again, I really want to understand why all these technologies designed to >>>> offer redundancy conspired to reduce uptime and create failures where there >>>> weren't any otherwise. I thought with hosted engine, 3 ovirt servers and >>>> glusterfs with minimum replica 2+arb or replica 3 should have offered >>>> strong resilience against server failure or disk failure, and should have >>>> prevented / recovered from data corruption. Instead, all of the above >>>> happened (once I get my cluster back up, I still have to try and recover my >>>> webserver VM, which won't boot due to XFS corrupt journal issues created >>>> during the gluster crashes). I think a lot of these issues were rooted >>>> from the upgrade from 4.1 to 4.2. >>>> >>>> --Jim >>>> >>>> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir <jim@palousetech.com >>>> > wrote: >>>> >>>>> I also finally found the following in my system log on one >>>>> server: >>>>> >>>>> [10679.524491] INFO: task glusterclogro:14933 blocked for more >>>>> than 120 seconds. >>>>> [10679.525826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [10679.527144] glusterclogro D ffff97209832bf40 0 14933 >>>>> 1 0x00000080 >>>>> [10679.527150] Call Trace: >>>>> [10679.527161] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [10679.527218] [<ffffffffc060e388>] >>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs] >>>>> [10679.527225] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20 >>>>> [10679.527254] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 >>>>> [xfs] >>>>> [10679.527260] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0 >>>>> [10679.527268] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [10679.527271] [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20 >>>>> [10679.527275] [<ffffffffb992082f>] >>>>> system_call_fastpath+0x1c/0x21 >>>>> [10679.527279] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [10679.527283] INFO: task glusterposixfsy:14941 blocked for more >>>>> than 120 seconds. >>>>> [10679.528608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [10679.529956] glusterposixfsy D ffff972495f84f10 0 14941 >>>>> 1 0x00000080 >>>>> [10679.529961] Call Trace: >>>>> [10679.529966] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [10679.530003] [<ffffffffc060e388>] >>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs] >>>>> [10679.530008] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20 >>>>> [10679.530038] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 >>>>> [xfs] >>>>> [10679.530042] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0 >>>>> [10679.530046] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [10679.530050] [<ffffffffb944f3f3>] SyS_fdatasync+0x13/0x20 >>>>> [10679.530054] [<ffffffffb992082f>] >>>>> system_call_fastpath+0x1c/0x21 >>>>> [10679.530058] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [10679.530062] INFO: task glusteriotwr13:15486 blocked for more >>>>> than 120 seconds. >>>>> [10679.531805] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [10679.533732] glusteriotwr13 D ffff9720a83f0000 0 15486 >>>>> 1 0x00000080 >>>>> [10679.533738] Call Trace: >>>>> [10679.533747] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [10679.533799] [<ffffffffc060e388>] >>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs] >>>>> [10679.533806] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20 >>>>> [10679.533846] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 >>>>> [xfs] >>>>> [10679.533852] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0 >>>>> [10679.533858] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [10679.533863] [<ffffffffb944f3f3>] SyS_fdatasync+0x13/0x20 >>>>> [10679.533868] [<ffffffffb992082f>] >>>>> system_call_fastpath+0x1c/0x21 >>>>> [10679.533873] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [10919.512757] INFO: task glusterclogro:14933 blocked for more >>>>> than 120 seconds. >>>>> [10919.514714] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [10919.516663] glusterclogro D ffff97209832bf40 0 14933 >>>>> 1 0x00000080 >>>>> [10919.516677] Call Trace: >>>>> [10919.516690] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [10919.516696] [<ffffffffb99118e9>] schedule_timeout+0x239/0x2c0 >>>>> [10919.516703] [<ffffffffb951cc04>] ? blk_finish_plug+0x14/0x40 >>>>> [10919.516768] [<ffffffffc05e9224>] ? >>>>> _xfs_buf_ioapply+0x334/0x460 [xfs] >>>>> [10919.516774] [<ffffffffb991432d>] >>>>> wait_for_completion+0xfd/0x140 >>>>> [10919.516782] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20 >>>>> [10919.516821] [<ffffffffc05eb0a3>] ? _xfs_buf_read+0x23/0x40 >>>>> [xfs] >>>>> [10919.516859] [<ffffffffc05eafa9>] >>>>> xfs_buf_submit_wait+0xf9/0x1d0 [xfs] >>>>> [10919.516902] [<ffffffffc061b279>] ? >>>>> xfs_trans_read_buf_map+0x199/0x400 [xfs] >>>>> [10919.516940] [<ffffffffc05eb0a3>] _xfs_buf_read+0x23/0x40 >>>>> [xfs] >>>>> [10919.516977] [<ffffffffc05eb1b9>] xfs_buf_read_map+0xf9/0x160 >>>>> [xfs] >>>>> [10919.517022] [<ffffffffc061b279>] >>>>> xfs_trans_read_buf_map+0x199/0x400 [xfs] >>>>> [10919.517057] [<ffffffffc05c8d04>] xfs_da_read_buf+0xd4/0x100 >>>>> [xfs] >>>>> [10919.517091] [<ffffffffc05c8d53>] xfs_da3_node_read+0x23/0xd0 >>>>> [xfs] >>>>> [10919.517126] [<ffffffffc05c9fee>] >>>>> xfs_da3_node_lookup_int+0x6e/0x2f0 [xfs] >>>>> [10919.517160] [<ffffffffc05d5a1d>] >>>>> xfs_dir2_node_lookup+0x4d/0x170 [xfs] >>>>> [10919.517194] [<ffffffffc05ccf5d>] xfs_dir_lookup+0x1bd/0x1e0 >>>>> [xfs] >>>>> [10919.517233] [<ffffffffc05fd8d9>] xfs_lookup+0x69/0x140 [xfs] >>>>> [10919.517271] [<ffffffffc05fa018>] xfs_vn_lookup+0x78/0xc0 >>>>> [xfs] >>>>> [10919.517278] [<ffffffffb9425cf3>] lookup_real+0x23/0x60 >>>>> [10919.517283] [<ffffffffb9426702>] __lookup_hash+0x42/0x60 >>>>> [10919.517288] [<ffffffffb942d519>] SYSC_renameat2+0x3a9/0x5a0 >>>>> [10919.517296] [<ffffffffb94d3753>] ? >>>>> selinux_file_free_security+0x23/0x30 >>>>> [10919.517304] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [10919.517309] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [10919.517313] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [10919.517318] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [10919.517323] [<ffffffffb942e58e>] SyS_renameat2+0xe/0x10 >>>>> [10919.517328] [<ffffffffb942e5ce>] SyS_rename+0x1e/0x20 >>>>> [10919.517333] [<ffffffffb992082f>] >>>>> system_call_fastpath+0x1c/0x21 >>>>> [10919.517339] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [11159.496095] INFO: task glusteriotwr9:15482 blocked for more >>>>> than 120 seconds. >>>>> [11159.497546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [11159.498978] glusteriotwr9 D ffff971fa0fa1fa0 0 15482 >>>>> 1 0x00000080 >>>>> [11159.498984] Call Trace: >>>>> [11159.498995] [<ffffffffb9911f00>] ? bit_wait+0x50/0x50 >>>>> [11159.498999] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [11159.499003] [<ffffffffb99118e9>] schedule_timeout+0x239/0x2c0 >>>>> [11159.499056] [<ffffffffc05dd9b7>] ? >>>>> xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs] >>>>> [11159.499082] [<ffffffffc05dd43e>] ? >>>>> xfs_iext_bno_to_irec+0x8e/0xd0 [xfs] >>>>> [11159.499090] [<ffffffffb92f7a12>] ? ktime_get_ts64+0x52/0xf0 >>>>> [11159.499093] [<ffffffffb9911f00>] ? bit_wait+0x50/0x50 >>>>> [11159.499097] [<ffffffffb991348d>] >>>>> io_schedule_timeout+0xad/0x130 >>>>> [11159.499101] [<ffffffffb9913528>] io_schedule+0x18/0x20 >>>>> [11159.499104] [<ffffffffb9911f11>] bit_wait_io+0x11/0x50 >>>>> [11159.499107] [<ffffffffb9911ac1>] __wait_on_bit_lock+0x61/0xc0 >>>>> [11159.499113] [<ffffffffb9393634>] __lock_page+0x74/0x90 >>>>> [11159.499118] [<ffffffffb92bc210>] ? >>>>> wake_bit_function+0x40/0x40 >>>>> [11159.499121] [<ffffffffb9394154>] __find_lock_page+0x54/0x70 >>>>> [11159.499125] [<ffffffffb9394e85>] >>>>> grab_cache_page_write_begin+0x55/0xc0 >>>>> [11159.499130] [<ffffffffb9484b76>] iomap_write_begin+0x66/0x100 >>>>> [11159.499135] [<ffffffffb9484edf>] iomap_write_actor+0xcf/0x1d0 >>>>> [11159.499140] [<ffffffffb9484e10>] ? iomap_write_end+0x80/0x80 >>>>> [11159.499144] [<ffffffffb94854e7>] iomap_apply+0xb7/0x150 >>>>> [11159.499149] [<ffffffffb9485621>] >>>>> iomap_file_buffered_write+0xa1/0xe0 >>>>> [11159.499153] [<ffffffffb9484e10>] ? iomap_write_end+0x80/0x80 >>>>> [11159.499182] [<ffffffffc05f025d>] >>>>> xfs_file_buffered_aio_write+0x12d/0x2c0 [xfs] >>>>> [11159.499213] [<ffffffffc05f057d>] >>>>> xfs_file_aio_write+0x18d/0x1b0 [xfs] >>>>> [11159.499217] [<ffffffffb941a533>] do_sync_write+0x93/0xe0 >>>>> [11159.499222] [<ffffffffb941b010>] vfs_write+0xc0/0x1f0 >>>>> [11159.499225] [<ffffffffb941c002>] SyS_pwrite64+0x92/0xc0 >>>>> [11159.499230] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [11159.499234] [<ffffffffb992082f>] >>>>> system_call_fastpath+0x1c/0x21 >>>>> [11159.499238] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [11279.488720] INFO: task xfsaild/dm-10:1134 blocked for more >>>>> than 120 seconds. >>>>> [11279.490197] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [11279.491665] xfsaild/dm-10 D ffff9720a8660fd0 0 1134 >>>>> 2 0x00000000 >>>>> [11279.491671] Call Trace: >>>>> [11279.491682] [<ffffffffb92a3a2e>] ? >>>>> try_to_del_timer_sync+0x5e/0x90 >>>>> [11279.491688] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [11279.491744] [<ffffffffc060de36>] _xfs_log_force+0x1c6/0x2c0 >>>>> [xfs] >>>>> [11279.491750] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20 >>>>> [11279.491783] [<ffffffffc0619fec>] ? xfsaild+0x16c/0x6f0 [xfs] >>>>> [11279.491817] [<ffffffffc060df5c>] xfs_log_force+0x2c/0x70 >>>>> [xfs] >>>>> [11279.491849] [<ffffffffc0619e80>] ? >>>>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs] >>>>> [11279.491880] [<ffffffffc0619fec>] xfsaild+0x16c/0x6f0 [xfs] >>>>> [11279.491913] [<ffffffffc0619e80>] ? >>>>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs] >>>>> [11279.491919] [<ffffffffb92bb161>] kthread+0xd1/0xe0 >>>>> [11279.491926] [<ffffffffb92bb090>] ? >>>>> insert_kthread_work+0x40/0x40 >>>>> [11279.491932] [<ffffffffb9920677>] >>>>> ret_from_fork_nospec_begin+0x21/0x21 >>>>> [11279.491936] [<ffffffffb92bb090>] ? >>>>> insert_kthread_work+0x40/0x40 >>>>> [11279.491976] INFO: task glusterclogfsyn:14934 blocked for more >>>>> than 120 seconds. >>>>> [11279.493466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [11279.494952] glusterclogfsyn D ffff97209832af70 0 14934 >>>>> 1 0x00000080 >>>>> [11279.494957] Call Trace: >>>>> [11279.494979] [<ffffffffc0309839>] ? >>>>> __split_and_process_bio+0x2e9/0x520 [dm_mod] >>>>> [11279.494983] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [11279.494987] [<ffffffffb99118e9>] schedule_timeout+0x239/0x2c0 >>>>> [11279.494997] [<ffffffffc0309d98>] ? >>>>> dm_make_request+0x128/0x1a0 [dm_mod] >>>>> [11279.495001] [<ffffffffb991348d>] >>>>> io_schedule_timeout+0xad/0x130 >>>>> [11279.495005] [<ffffffffb99145ad>] >>>>> wait_for_completion_io+0xfd/0x140 >>>>> [11279.495010] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20 >>>>> [11279.495016] [<ffffffffb951e574>] >>>>> blkdev_issue_flush+0xb4/0x110 >>>>> [11279.495049] [<ffffffffc06064b9>] >>>>> xfs_blkdev_issue_flush+0x19/0x20 [xfs] >>>>> [11279.495079] [<ffffffffc05eec40>] xfs_file_fsync+0x1b0/0x1e0 >>>>> [xfs] >>>>> [11279.495086] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0 >>>>> [11279.495090] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [11279.495094] [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20 >>>>> [11279.495098] [<ffffffffb992082f>] >>>>> system_call_fastpath+0x1c/0x21 >>>>> [11279.495102] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [11279.495105] INFO: task glusterposixfsy:14941 blocked for more >>>>> than 120 seconds. >>>>> [11279.496606] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [11279.498114] glusterposixfsy D ffff972495f84f10 0 14941 >>>>> 1 0x00000080 >>>>> [11279.498118] Call Trace: >>>>> [11279.498134] [<ffffffffc0309839>] ? >>>>> __split_and_process_bio+0x2e9/0x520 [dm_mod] >>>>> [11279.498138] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [11279.498142] [<ffffffffb99118e9>] schedule_timeout+0x239/0x2c0 >>>>> [11279.498152] [<ffffffffc0309d98>] ? >>>>> dm_make_request+0x128/0x1a0 [dm_mod] >>>>> [11279.498156] [<ffffffffb991348d>] >>>>> io_schedule_timeout+0xad/0x130 >>>>> [11279.498160] [<ffffffffb99145ad>] >>>>> wait_for_completion_io+0xfd/0x140 >>>>> [11279.498165] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20 >>>>> [11279.498169] [<ffffffffb951e574>] >>>>> blkdev_issue_flush+0xb4/0x110 >>>>> [11279.498202] [<ffffffffc06064b9>] >>>>> xfs_blkdev_issue_flush+0x19/0x20 [xfs] >>>>> [11279.498231] [<ffffffffc05eec40>] xfs_file_fsync+0x1b0/0x1e0 >>>>> [xfs] >>>>> [11279.498238] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0 >>>>> [11279.498242] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [11279.498246] [<ffffffffb944f3f3>] SyS_fdatasync+0x13/0x20 >>>>> [11279.498250] [<ffffffffb992082f>] >>>>> system_call_fastpath+0x1c/0x21 >>>>> [11279.498254] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [11279.498257] INFO: task glusteriotwr1:14950 blocked for more >>>>> than 120 seconds. >>>>> [11279.499789] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [11279.501343] glusteriotwr1 D ffff97208b6daf70 0 14950 >>>>> 1 0x00000080 >>>>> [11279.501348] Call Trace: >>>>> [11279.501353] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [11279.501390] [<ffffffffc060e388>] >>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs] >>>>> [11279.501396] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20 >>>>> [11279.501428] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 >>>>> [xfs] >>>>> [11279.501432] [<ffffffffb944ef3f>] generic_write_sync+0x4f/0x70 >>>>> [11279.501461] [<ffffffffc05f0545>] >>>>> xfs_file_aio_write+0x155/0x1b0 [xfs] >>>>> [11279.501466] [<ffffffffb941a533>] do_sync_write+0x93/0xe0 >>>>> [11279.501471] [<ffffffffb941b010>] vfs_write+0xc0/0x1f0 >>>>> [11279.501475] [<ffffffffb941c002>] SyS_pwrite64+0x92/0xc0 >>>>> [11279.501479] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [11279.501483] [<ffffffffb992082f>] >>>>> system_call_fastpath+0x1c/0x21 >>>>> [11279.501489] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [11279.501493] INFO: task glusteriotwr4:14953 blocked for more >>>>> than 120 seconds. >>>>> [11279.503047] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>> disables this message. >>>>> [11279.504630] glusteriotwr4 D ffff972499f2bf40 0 14953 >>>>> 1 0x00000080 >>>>> [11279.504635] Call Trace: >>>>> [11279.504640] [<ffffffffb9913f79>] schedule+0x29/0x70 >>>>> [11279.504676] [<ffffffffc060e388>] >>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs] >>>>> [11279.504681] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20 >>>>> [11279.504710] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 >>>>> [xfs] >>>>> [11279.504714] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0 >>>>> [11279.504718] [<ffffffffb992076f>] ? >>>>> system_call_after_swapgs+0xbc/0x160 >>>>> [11279.504722] [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20 >>>>> [11279.504725] [<ffffffffb992082f>] >>>>> system_call_fastpath+0x1c/0x21 >>>>> [11279.504730] [<ffffffffb992077b>] ? >>>>> system_call_after_swapgs+0xc8/0x160 >>>>> [12127.466494] perf: interrupt took too long (8263 > 8150), >>>>> lowering kernel.perf_event_max_sample_rate to 24000 >>>>> >>>>> -------------------- >>>>> I think this is the cause of the massive ovirt performance >>>>> issues irrespective of gluster volume. At the time this happened, I was >>>>> also ssh'ed into the host, and was doing some rpm querry commands. I had >>>>> just run rpm -qa |grep glusterfs (to verify what version was actually >>>>> installed), and that command took almost 2 minutes to return! Normally it >>>>> takes less than 2 seconds. That is all pure local SSD IO, too.... >>>>> >>>>> I'm no expert, but its my understanding that anytime a software >>>>> causes these kinds of issues, its a serious bug in the software, even if >>>>> its mis-handled exceptions. Is this correct? >>>>> >>>>> --Jim >>>>> >>>>> On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir < >>>>> jim@palousetech.com> wrote: >>>>> >>>>>> I think this is the profile information for one of the volumes >>>>>> that lives on the SSDs and is fully operational with no down/problem disks: >>>>>> >>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile data info >>>>>> Brick: ovirt2.nwfiber.com:/gluster/brick2/data >>>>>> ---------------------------------------------- >>>>>> Cumulative Stats: >>>>>> Block Size: 256b+ 512b+ >>>>>> 1024b+ >>>>>> No. of Reads: 983 2696 >>>>>> 1059 >>>>>> No. of Writes: 0 1113 >>>>>> 302 >>>>>> >>>>>> Block Size: 2048b+ 4096b+ >>>>>> 8192b+ >>>>>> No. of Reads: 852 88608 >>>>>> 53526 >>>>>> No. of Writes: 522 812340 >>>>>> 76257 >>>>>> >>>>>> Block Size: 16384b+ 32768b+ >>>>>> 65536b+ >>>>>> No. of Reads: 54351 241901 >>>>>> 15024 >>>>>> No. of Writes: 21636 8656 >>>>>> 8976 >>>>>> >>>>>> Block Size: 131072b+ >>>>>> No. of Reads: 524156 >>>>>> No. of Writes: 296071 >>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of >>>>>> calls Fop >>>>>> --------- ----------- ----------- ----------- >>>>>> ------------ ---- >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 4189 RELEASE >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 1257 RELEASEDIR >>>>>> 0.00 46.19 us 12.00 us 187.00 us >>>>>> 69 FLUSH >>>>>> 0.00 147.00 us 78.00 us 367.00 us >>>>>> 86 REMOVEXATTR >>>>>> 0.00 223.46 us 24.00 us 1166.00 us >>>>>> 149 READDIR >>>>>> 0.00 565.34 us 76.00 us 3639.00 us >>>>>> 88 FTRUNCATE >>>>>> 0.00 263.28 us 20.00 us 28385.00 us >>>>>> 228 LK >>>>>> 0.00 98.84 us 2.00 us 880.00 us >>>>>> 1198 OPENDIR >>>>>> 0.00 91.59 us 26.00 us 10371.00 us >>>>>> 3853 STATFS >>>>>> 0.00 494.14 us 17.00 us 193439.00 us >>>>>> 1171 GETXATTR >>>>>> 0.00 299.42 us 35.00 us 9799.00 us >>>>>> 2044 READDIRP >>>>>> 0.00 1965.31 us 110.00 us 382258.00 us >>>>>> 321 XATTROP >>>>>> 0.01 113.40 us 24.00 us 61061.00 us >>>>>> 8134 STAT >>>>>> 0.01 755.38 us 57.00 us 607603.00 us >>>>>> 3196 DISCARD >>>>>> 0.05 2690.09 us 58.00 us 2704761.00 us >>>>>> 3206 OPEN >>>>>> 0.10 119978.25 us 97.00 us 9406684.00 us >>>>>> 154 SETATTR >>>>>> 0.18 101.73 us 28.00 us 700477.00 us >>>>>> 313379 FSTAT >>>>>> 0.23 1059.84 us 25.00 us 2716124.00 us >>>>>> 38255 LOOKUP >>>>>> 0.47 1024.11 us 54.00 us 6197164.00 us >>>>>> 81455 FXATTROP >>>>>> 1.72 2984.00 us 15.00 us 37098954.00 us >>>>>> 103020 FINODELK >>>>>> 5.92 44315.32 us 51.00 us 24731536.00 us >>>>>> 23957 FSYNC >>>>>> 13.27 2399.78 us 25.00 us 22089540.00 us >>>>>> 991005 READ >>>>>> 37.00 5980.43 us 52.00 us 22099889.00 us >>>>>> 1108976 WRITE >>>>>> 41.04 5452.75 us 13.00 us 22102452.00 us >>>>>> 1349053 INODELK >>>>>> >>>>>> Duration: 10026 seconds >>>>>> Data Read: 80046027759 bytes >>>>>> Data Written: 44496632320 bytes >>>>>> >>>>>> Interval 1 Stats: >>>>>> Block Size: 256b+ 512b+ >>>>>> 1024b+ >>>>>> No. of Reads: 983 2696 >>>>>> 1059 >>>>>> No. of Writes: 0 838 >>>>>> 185 >>>>>> >>>>>> Block Size: 2048b+ 4096b+ >>>>>> 8192b+ >>>>>> No. of Reads: 852 85856 >>>>>> 51575 >>>>>> No. of Writes: 382 705802 >>>>>> 57812 >>>>>> >>>>>> Block Size: 16384b+ 32768b+ >>>>>> 65536b+ >>>>>> No. of Reads: 52673 232093 >>>>>> 14984 >>>>>> No. of Writes: 13499 4908 >>>>>> 4242 >>>>>> >>>>>> Block Size: 131072b+ >>>>>> No. of Reads: 460040 >>>>>> No. of Writes: 6411 >>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of >>>>>> calls Fop >>>>>> --------- ----------- ----------- ----------- >>>>>> ------------ ---- >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 2093 RELEASE >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 1093 RELEASEDIR >>>>>> 0.00 53.38 us 26.00 us 111.00 us >>>>>> 16 FLUSH >>>>>> 0.00 145.14 us 78.00 us 367.00 us >>>>>> 71 REMOVEXATTR >>>>>> 0.00 190.96 us 114.00 us 298.00 us >>>>>> 71 SETATTR >>>>>> 0.00 213.38 us 24.00 us 1145.00 us >>>>>> 90 READDIR >>>>>> 0.00 263.28 us 20.00 us 28385.00 us >>>>>> 228 LK >>>>>> 0.00 101.76 us 2.00 us 880.00 us >>>>>> 1093 OPENDIR >>>>>> 0.01 93.60 us 27.00 us 10371.00 us >>>>>> 3090 STATFS >>>>>> 0.02 537.47 us 17.00 us 193439.00 us >>>>>> 1038 GETXATTR >>>>>> 0.03 297.44 us 35.00 us 9799.00 us >>>>>> 1990 READDIRP >>>>>> 0.03 2357.28 us 110.00 us 382258.00 us >>>>>> 253 XATTROP >>>>>> 0.04 385.93 us 58.00 us 47593.00 us >>>>>> 2091 OPEN >>>>>> 0.04 114.86 us 24.00 us 61061.00 us >>>>>> 7715 STAT >>>>>> 0.06 444.59 us 57.00 us 333240.00 us >>>>>> 3053 DISCARD >>>>>> 0.42 316.24 us 25.00 us 290728.00 us >>>>>> 29823 LOOKUP >>>>>> 0.73 257.92 us 54.00 us 344812.00 us >>>>>> 63296 FXATTROP >>>>>> 1.37 98.30 us 28.00 us 67621.00 us >>>>>> 313172 FSTAT >>>>>> 1.58 2124.69 us 51.00 us 849200.00 us >>>>>> 16717 FSYNC >>>>>> 5.73 162.46 us 52.00 us 748492.00 us >>>>>> 794079 WRITE >>>>>> 7.19 2065.17 us 16.00 us 37098954.00 us >>>>>> 78381 FINODELK >>>>>> 36.44 886.32 us 25.00 us 2216436.00 us >>>>>> 925421 READ >>>>>> 46.30 1178.04 us 13.00 us 1700704.00 us >>>>>> 884635 INODELK >>>>>> >>>>>> Duration: 7485 seconds >>>>>> Data Read: 71250527215 bytes >>>>>> Data Written: 5119903744 bytes >>>>>> >>>>>> Brick: ovirt3.nwfiber.com:/gluster/brick2/data >>>>>> ---------------------------------------------- >>>>>> Cumulative Stats: >>>>>> Block Size: 1b+ >>>>>> No. of Reads: 0 >>>>>> No. of Writes: 3264419 >>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of >>>>>> calls Fop >>>>>> --------- ----------- ----------- ----------- >>>>>> ------------ ---- >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 90 FORGET >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 9462 RELEASE >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 4254 RELEASEDIR >>>>>> 0.00 50.52 us 13.00 us 190.00 us >>>>>> 71 FLUSH >>>>>> 0.00 186.97 us 87.00 us 713.00 us >>>>>> 86 REMOVEXATTR >>>>>> 0.00 79.32 us 33.00 us 189.00 us >>>>>> 228 LK >>>>>> 0.00 220.98 us 129.00 us 513.00 us >>>>>> 86 SETATTR >>>>>> 0.01 259.30 us 26.00 us 2632.00 us >>>>>> 137 READDIR >>>>>> 0.02 322.76 us 145.00 us 2125.00 us >>>>>> 321 XATTROP >>>>>> 0.03 109.55 us 2.00 us 1258.00 us >>>>>> 1193 OPENDIR >>>>>> 0.05 70.21 us 21.00 us 431.00 us >>>>>> 3196 DISCARD >>>>>> 0.05 169.26 us 21.00 us 2315.00 us >>>>>> 1545 GETXATTR >>>>>> 0.12 176.85 us 63.00 us 2844.00 us >>>>>> 3206 OPEN >>>>>> 0.61 303.49 us 90.00 us 3085.00 us >>>>>> 9633 FSTAT >>>>>> 2.44 305.66 us 28.00 us 3716.00 us >>>>>> 38230 LOOKUP >>>>>> 4.52 266.22 us 55.00 us 53424.00 us >>>>>> 81455 FXATTROP >>>>>> 6.96 1397.99 us 51.00 us 64822.00 us >>>>>> 23889 FSYNC >>>>>> 16.48 84.74 us 25.00 us 6917.00 us >>>>>> 932592 WRITE >>>>>> 30.16 106.90 us 13.00 us 3920189.00 us >>>>>> 1353046 INODELK >>>>>> 38.55 1794.52 us 14.00 us 16210553.00 us >>>>>> 103039 FINODELK >>>>>> >>>>>> Duration: 66562 seconds >>>>>> Data Read: 0 bytes >>>>>> Data Written: 3264419 bytes >>>>>> >>>>>> Interval 1 Stats: >>>>>> Block Size: 1b+ >>>>>> No. of Reads: 0 >>>>>> No. of Writes: 794080 >>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of >>>>>> calls Fop >>>>>> --------- ----------- ----------- ----------- >>>>>> ------------ ---- >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 2093 RELEASE >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 1093 RELEASEDIR >>>>>> 0.00 70.31 us 26.00 us 125.00 us >>>>>> 16 FLUSH >>>>>> 0.00 193.10 us 103.00 us 713.00 us >>>>>> 71 REMOVEXATTR >>>>>> 0.01 227.32 us 133.00 us 513.00 us >>>>>> 71 SETATTR >>>>>> 0.01 79.32 us 33.00 us 189.00 us >>>>>> 228 LK >>>>>> 0.01 259.83 us 35.00 us 1138.00 us >>>>>> 89 READDIR >>>>>> 0.03 318.26 us 145.00 us 2047.00 us >>>>>> 253 XATTROP >>>>>> 0.04 112.67 us 3.00 us 1258.00 us >>>>>> 1093 OPENDIR >>>>>> 0.06 167.98 us 23.00 us 1951.00 us >>>>>> 1014 GETXATTR >>>>>> 0.08 70.97 us 22.00 us 431.00 us >>>>>> 3053 DISCARD >>>>>> 0.13 183.78 us 66.00 us 2844.00 us >>>>>> 2091 OPEN >>>>>> 1.01 303.82 us 90.00 us 3085.00 us >>>>>> 9610 FSTAT >>>>>> 3.27 316.59 us 30.00 us 3716.00 us >>>>>> 29820 LOOKUP >>>>>> 5.83 265.79 us 59.00 us 53424.00 us >>>>>> 63296 FXATTROP >>>>>> 7.95 1373.89 us 51.00 us 64822.00 us >>>>>> 16717 FSYNC >>>>>> 23.17 851.99 us 14.00 us 16210553.00 us >>>>>> 78555 FINODELK >>>>>> 24.04 87.44 us 27.00 us 6917.00 us >>>>>> 794081 WRITE >>>>>> 34.36 111.91 us 14.00 us 984871.00 us >>>>>> 886790 INODELK >>>>>> >>>>>> Duration: 7485 seconds >>>>>> Data Read: 0 bytes >>>>>> Data Written: 794080 bytes >>>>>> >>>>>> >>>>>> ----------------------- >>>>>> Here is the data from the volume that is backed by the SHDDs >>>>>> and has one failed disk: >>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile data-hdd info >>>>>> Brick: 172.172.1.12:/gluster/brick3/data-hdd >>>>>> -------------------------------------------- >>>>>> Cumulative Stats: >>>>>> Block Size: 256b+ 512b+ >>>>>> 1024b+ >>>>>> No. of Reads: 1702 86 >>>>>> 16 >>>>>> No. of Writes: 0 767 >>>>>> 71 >>>>>> >>>>>> Block Size: 2048b+ 4096b+ >>>>>> 8192b+ >>>>>> No. of Reads: 19 51841 >>>>>> 2049 >>>>>> No. of Writes: 76 60668 >>>>>> 35727 >>>>>> >>>>>> Block Size: 16384b+ 32768b+ >>>>>> 65536b+ >>>>>> No. of Reads: 1744 639 >>>>>> 1088 >>>>>> No. of Writes: 8524 2410 >>>>>> 1285 >>>>>> >>>>>> Block Size: 131072b+ >>>>>> No. of Reads: 771999 >>>>>> No. of Writes: 29584 >>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of >>>>>> calls Fop >>>>>> --------- ----------- ----------- ----------- >>>>>> ------------ ---- >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 2902 RELEASE >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 1517 RELEASEDIR >>>>>> 0.00 197.00 us 197.00 us 197.00 us >>>>>> 1 FTRUNCATE >>>>>> 0.00 70.24 us 16.00 us 758.00 us >>>>>> 51 FLUSH >>>>>> 0.00 143.93 us 82.00 us 305.00 us >>>>>> 57 REMOVEXATTR >>>>>> 0.00 178.63 us 105.00 us 712.00 us >>>>>> 60 SETATTR >>>>>> 0.00 67.30 us 19.00 us 572.00 us >>>>>> 555 LK >>>>>> 0.00 322.80 us 23.00 us 4673.00 us >>>>>> 138 READDIR >>>>>> 0.00 336.56 us 106.00 us 11994.00 us >>>>>> 237 XATTROP >>>>>> 0.00 84.70 us 28.00 us 1071.00 us >>>>>> 3469 STATFS >>>>>> 0.01 387.75 us 2.00 us 146017.00 us >>>>>> 1467 OPENDIR >>>>>> 0.01 148.59 us 21.00 us 64374.00 us >>>>>> 4454 STAT >>>>>> 0.02 783.02 us 16.00 us 93502.00 us >>>>>> 1902 GETXATTR >>>>>> 0.03 1516.10 us 17.00 us 210690.00 us >>>>>> 1364 ENTRYLK >>>>>> 0.03 2555.47 us 300.00 us 674454.00 us >>>>>> 1064 READDIRP >>>>>> 0.07 85.74 us 19.00 us 68340.00 us >>>>>> 62849 FSTAT >>>>>> 0.07 1978.12 us 59.00 us 202596.00 us >>>>>> 2729 OPEN >>>>>> 0.22 708.57 us 15.00 us 394799.00 us >>>>>> 25447 LOOKUP >>>>>> 5.94 2331.74 us 15.00 us 1099530.00 us >>>>>> 207534 FINODELK >>>>>> 7.31 8311.75 us 58.00 us 1800216.00 us >>>>>> 71668 FXATTROP >>>>>> 12.49 7735.19 us 51.00 us 3595513.00 us >>>>>> 131642 WRITE >>>>>> 17.70 957.08 us 16.00 us 13700466.00 us >>>>>> 1508160 INODELK >>>>>> 24.55 2546.43 us 26.00 us 5077347.00 us >>>>>> 786060 READ >>>>>> 31.56 49699.15 us 47.00 us 3746331.00 us >>>>>> 51777 FSYNC >>>>>> >>>>>> Duration: 10101 seconds >>>>>> Data Read: 101562897361 bytes >>>>>> Data Written: 4834450432 bytes >>>>>> >>>>>> Interval 0 Stats: >>>>>> Block Size: 256b+ 512b+ >>>>>> 1024b+ >>>>>> No. of Reads: 1702 86 >>>>>> 16 >>>>>> No. of Writes: 0 767 >>>>>> 71 >>>>>> >>>>>> Block Size: 2048b+ 4096b+ >>>>>> 8192b+ >>>>>> No. of Reads: 19 51841 >>>>>> 2049 >>>>>> No. of Writes: 76 60668 >>>>>> 35727 >>>>>> >>>>>> Block Size: 16384b+ 32768b+ >>>>>> 65536b+ >>>>>> No. of Reads: 1744 639 >>>>>> 1088 >>>>>> No. of Writes: 8524 2410 >>>>>> 1285 >>>>>> >>>>>> Block Size: 131072b+ >>>>>> No. of Reads: 771999 >>>>>> No. of Writes: 29584 >>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of >>>>>> calls Fop >>>>>> --------- ----------- ----------- ----------- >>>>>> ------------ ---- >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 2902 RELEASE >>>>>> 0.00 0.00 us 0.00 us 0.00 us >>>>>> 1517 RELEASEDIR >>>>>> 0.00 197.00 us 197.00 us 197.00 us >>>>>> 1 FTRUNCATE >>>>>> 0.00 70.24 us 16.00 us 758.00 us >>>>>> 51 FLUSH >>>>>> 0.00 143.93 us 82.00 us 305.00 us >>>>>> 57 REMOVEXATTR >>>>>> 0.00 178.63 us 105.00 us 712.00 us >>>>>> 60 SETATTR >>>>>> 0.00 67.30 us 19.00 us 572.00 us >>>>>> 555 LK >>>>>> 0.00 322.80 us 23.00 us 4673.00 us >>>>>> 138 READDIR >>>>>> 0.00 336.56 us 106.00 us 11994.00 us >>>>>> 237 XATTROP >>>>>> 0.00 84.70 us 28.00 us 1071.00 us >>>>>> 3469 STATFS >>>>>> 0.01 387.75 us 2.00 us 146017.00 us >>>>>> 1467 OPENDIR >>>>>> 0.01 148.59 us 21.00 us 64374.00 us >>>>>> 4454 STAT >>>>>> 0.02 783.02 us 16.00 us 93502.00 us >>>>>> 1902 GETXATTR >>>>>> 0.03 1516.10 us 17.00 us 210690.00 us >>>>>> 1364 ENTRYLK >>>>>> 0.03 2555.47 us 300.00 us 674454.00 us >>>>>> 1064 READDIRP >>>>>> 0.07 85.73 us 19.00 us 68340.00 us >>>>>> 62849 FSTAT >>>>>> 0.07 1978.12 us 59.00 us 202596.00 us >>>>>> 2729 OPEN >>>>>> 0.22 708.57 us 15.00 us 394799.00 us >>>>>> 25447 LOOKUP >>>>>> 5.94 2334.57 us 15.00 us 1099530.00 us >>>>>> 207534 FINODELK >>>>>> 7.31 8311.49 us 58.00 us 1800216.00 us >>>>>> 71668 FXATTROP >>>>>> 12.49 7735.32 us 51.00 us 3595513.00 us >>>>>> 131642 WRITE >>>>>> 17.71 957.08 us 16.00 us 13700466.00 us >>>>>> 1508160 INODELK >>>>>> 24.56 2546.42 us 26.00 us 5077347.00 us >>>>>> 786060 READ >>>>>> 31.54 49651.63 us 47.00 us 3746331.00 us >>>>>> 51777 FSYNC >>>>>> >>>>>> Duration: 10101 seconds >>>>>> Data Read: 101562897361 bytes >>>>>> Data Written: 4834450432 bytes >>>>>> >>>>>> >>>>>> On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir < >>>>>> jim@palousetech.com> wrote: >>>>>> >>>>>>> Thank you for your response. >>>>>>> >>>>>>> I have 4 gluster volumes. 3 are replica 2 + arbitrator. >>>>>>> replica bricks are on ovirt1 and ovirt2, arbitrator on ovirt3. The 4th >>>>>>> volume is replica 3, with a brick on all three ovirt machines. >>>>>>> >>>>>>> The first 3 volumes are on an SSD disk; the 4th is on a >>>>>>> Seagate SSHD (same in all three machines). On ovirt3, the SSHD has >>>>>>> reported hard IO failures, and that brick is offline. However, the other >>>>>>> two replicas are fully operational (although they still show contents in >>>>>>> the heal info command that won't go away, but that may be the case until I >>>>>>> replace the failed disk). >>>>>>> >>>>>>> What is bothering me is that ALL 4 gluster volumes are showing >>>>>>> horrible performance issues. At this point, as the bad disk has been >>>>>>> completely offlined, I would expect gluster to perform at normal speed, but >>>>>>> that is definitely not the case. >>>>>>> >>>>>>> I've also noticed that the performance hits seem to come in >>>>>>> waves: things seem to work acceptably (but slow) for a while, then >>>>>>> suddenly, its as if all disk IO on all volumes (including non-gluster local >>>>>>> OS disk volumes for the hosts) pause for about 30 seconds, then IO resumes >>>>>>> again. During those times, I start getting VM not responding and host not >>>>>>> responding notices as well as the applications having major issues. >>>>>>> >>>>>>> I've shut down most of my VMs and am down to just my essential >>>>>>> core VMs (shedded about 75% of my VMs). I still am experiencing the same >>>>>>> issues. >>>>>>> >>>>>>> Am I correct in believing that once the failed disk was >>>>>>> brought offline that performance should return to normal? >>>>>>> >>>>>>> On Tue, May 29, 2018 at 1:27 PM, Alex K < >>>>>>> rightkicktech@gmail.com> wrote: >>>>>>> >>>>>>>> I would check disks status and accessibility of mount points >>>>>>>> where your gluster volumes reside. >>>>>>>> >>>>>>>> On Tue, May 29, 2018, 22:28 Jim Kusznir <jim@palousetech.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On one ovirt server, I'm now seeing these messages: >>>>>>>>> [56474.239725] blk_update_request: 63 callbacks suppressed >>>>>>>>> [56474.239732] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 0 >>>>>>>>> [56474.240602] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 3905945472 >>>>>>>>> [56474.241346] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 3905945584 >>>>>>>>> [56474.242236] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 2048 >>>>>>>>> [56474.243072] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 3905943424 >>>>>>>>> [56474.243997] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 3905943536 >>>>>>>>> [56474.247347] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 0 >>>>>>>>> [56474.248315] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 3905945472 >>>>>>>>> [56474.249231] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 3905945584 >>>>>>>>> [56474.250221] blk_update_request: I/O error, dev dm-2, >>>>>>>>> sector 2048 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir < >>>>>>>>> jim@palousetech.com> wrote: >>>>>>>>> >>>>>>>>>> I see in messages on ovirt3 (my 3rd machine, the one >>>>>>>>>> upgraded to 4.2): >>>>>>>>>> >>>>>>>>>> May 29 11:54:41 ovirt3 ovs-vsctl: >>>>>>>>>> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >>>>>>>>>> database connection failed (No such file or directory) >>>>>>>>>> May 29 11:54:51 ovirt3 ovs-vsctl: >>>>>>>>>> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >>>>>>>>>> database connection failed (No such file or directory) >>>>>>>>>> May 29 11:55:01 ovirt3 ovs-vsctl: >>>>>>>>>> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >>>>>>>>>> database connection failed (No such file or directory) >>>>>>>>>> (appears a lot). >>>>>>>>>> >>>>>>>>>> I also found on the ssh session of that, some sysv warnings >>>>>>>>>> about the backing disk for one of the gluster volumes (straight replica >>>>>>>>>> 3). The glusterfs process for that disk on that machine went offline. Its >>>>>>>>>> my understanding that it should continue to work with the other two >>>>>>>>>> machines while I attempt to replace that disk, right? Attempted writes >>>>>>>>>> (touching an empty file) can take 15 seconds, repeating it later will be >>>>>>>>>> much faster. >>>>>>>>>> >>>>>>>>>> Gluster generates a bunch of different log files, I don't >>>>>>>>>> know what ones you want, or from which machine(s). >>>>>>>>>> >>>>>>>>>> How do I do "volume profiling"? >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose < >>>>>>>>>> sabose@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> Do you see errors reported in the mount logs for the >>>>>>>>>>> volume? If so, could you attach the logs? >>>>>>>>>>> Any issues with your underlying disks. Can you also attach >>>>>>>>>>> output of volume profiling? >>>>>>>>>>> >>>>>>>>>>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir < >>>>>>>>>>> jim@palousetech.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Ok, things have gotten MUCH worse this morning. I'm >>>>>>>>>>>> getting random errors from VMs, right now, about a third of my VMs have >>>>>>>>>>>> been paused due to storage issues, and most of the remaining VMs are not >>>>>>>>>>>> performing well. >>>>>>>>>>>> >>>>>>>>>>>> At this point, I am in full EMERGENCY mode, as my >>>>>>>>>>>> production services are now impacted, and I'm getting calls coming in with >>>>>>>>>>>> problems... >>>>>>>>>>>> >>>>>>>>>>>> I'd greatly appreciate help...VMs are running VERY slowly >>>>>>>>>>>> (when they run), and they are steadily getting worse. I don't know why. I >>>>>>>>>>>> was seeing CPU peaks (to 100%) on several VMs, in perfect sync, for a few >>>>>>>>>>>> minutes at a time (while the VM became unresponsive and any VMs I was >>>>>>>>>>>> logged into that were linux were giving me the CPU stuck messages in my >>>>>>>>>>>> origional post). Is all this storage related? >>>>>>>>>>>> >>>>>>>>>>>> I also have two different gluster volumes for VM storage, >>>>>>>>>>>> and only one had the issues, but now VMs in both are being affected at the >>>>>>>>>>>> same time and same way. >>>>>>>>>>>> >>>>>>>>>>>> --Jim >>>>>>>>>>>> >>>>>>>>>>>> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose < >>>>>>>>>>>> sabose@redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> [Adding gluster-users to look at the heal issue] >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir < >>>>>>>>>>>>> jim@palousetech.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello: >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've been having some cluster and gluster performance >>>>>>>>>>>>>> issues lately. I also found that my cluster was out of date, and was >>>>>>>>>>>>>> trying to apply updates (hoping to fix some of these), and discovered the >>>>>>>>>>>>>> ovirt 4.1 repos were taken completely offline. So, I was forced to begin >>>>>>>>>>>>>> an upgrade to 4.2. According to docs I found/read, I needed only add the >>>>>>>>>>>>>> new repo, do a yum update, reboot, and be good on my hosts (did the yum >>>>>>>>>>>>>> update, the engine-setup on my hosted engine). Things seemed to work >>>>>>>>>>>>>> relatively well, except for a gluster sync issue that showed up. >>>>>>>>>>>>>> >>>>>>>>>>>>>> My cluster is a 3 node hyperconverged cluster. I >>>>>>>>>>>>>> upgraded the hosted engine first, then engine 3. When engine 3 came back >>>>>>>>>>>>>> up, for some reason one of my gluster volumes would not sync. Here's >>>>>>>>>>>>>> sample output: >>>>>>>>>>>>>> >>>>>>>>>>>>>> [root@ovirt3 ~]# gluster volume heal data-hdd info >>>>>>>>>>>>>> Brick 172.172.1.11:/gluster/brick3/data-hdd >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/48d7ecb8-7ac5-4 >>>>>>>>>>>>>> 725-bca5-b3519681cf2f/0d6080b0 >>>>>>>>>>>>>> -7018-4fa3-bb82-1dd9ef07d9b9 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/647be733-f153-4 >>>>>>>>>>>>>> cdc-85bd-ba72544c2631/b453a300 >>>>>>>>>>>>>> -0602-4be1-8310-8bd5abe00971 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/6da854d1-b6be-4 >>>>>>>>>>>>>> 46b-9bf0-90a0dbbea830/3c93bd1f >>>>>>>>>>>>>> -b7fa-4aa2-b445-6904e31839ba >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/7f647567-d18c-4 >>>>>>>>>>>>>> 4f1-a58e-9b8865833acb/f9364470 >>>>>>>>>>>>>> -9770-4bb1-a6b9-a54861849625 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/f3c8e7aa-6ef2-4 >>>>>>>>>>>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad >>>>>>>>>>>>>> -2606-44ef-9cd3-ae59610a504b >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/b1ea3f62-0f05-4 >>>>>>>>>>>>>> ded-8c82-9c91c90e0b61/d5d6bf5a >>>>>>>>>>>>>> -499f-431d-9013-5453db93ed32 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/8c8b5147-e9d6-4 >>>>>>>>>>>>>> 810-b45b-185e3ed65727/16f08231 >>>>>>>>>>>>>> -93b0-489d-a2fd-687b6bf88eaa >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/12924435-b9c2-4 >>>>>>>>>>>>>> aab-ba19-1c1bc31310ef/07b3db69 >>>>>>>>>>>>>> -440e-491e-854c-bbfa18a7cff2 >>>>>>>>>>>>>> Status: Connected >>>>>>>>>>>>>> Number of entries: 8 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Brick 172.172.1.12:/gluster/brick3/data-hdd >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/48d7ecb8-7ac5-4 >>>>>>>>>>>>>> 725-bca5-b3519681cf2f/0d6080b0 >>>>>>>>>>>>>> -7018-4fa3-bb82-1dd9ef07d9b9 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/647be733-f153-4 >>>>>>>>>>>>>> cdc-85bd-ba72544c2631/b453a300 >>>>>>>>>>>>>> -0602-4be1-8310-8bd5abe00971 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/b1ea3f62-0f05-4 >>>>>>>>>>>>>> ded-8c82-9c91c90e0b61/d5d6bf5a >>>>>>>>>>>>>> -499f-431d-9013-5453db93ed32 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/6da854d1-b6be-4 >>>>>>>>>>>>>> 46b-9bf0-90a0dbbea830/3c93bd1f >>>>>>>>>>>>>> -b7fa-4aa2-b445-6904e31839ba >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/7f647567-d18c-4 >>>>>>>>>>>>>> 4f1-a58e-9b8865833acb/f9364470 >>>>>>>>>>>>>> -9770-4bb1-a6b9-a54861849625 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/8c8b5147-e9d6-4 >>>>>>>>>>>>>> 810-b45b-185e3ed65727/16f08231 >>>>>>>>>>>>>> -93b0-489d-a2fd-687b6bf88eaa >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/12924435-b9c2-4 >>>>>>>>>>>>>> aab-ba19-1c1bc31310ef/07b3db69 >>>>>>>>>>>>>> -440e-491e-854c-bbfa18a7cff2 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/f3c8e7aa-6ef2-4 >>>>>>>>>>>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad >>>>>>>>>>>>>> -2606-44ef-9cd3-ae59610a504b >>>>>>>>>>>>>> Status: Connected >>>>>>>>>>>>>> Number of entries: 8 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Brick 172.172.1.13:/gluster/brick3/data-hdd >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/b1ea3f62-0f05-4 >>>>>>>>>>>>>> ded-8c82-9c91c90e0b61/d5d6bf5a >>>>>>>>>>>>>> -499f-431d-9013-5453db93ed32 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/8c8b5147-e9d6-4 >>>>>>>>>>>>>> 810-b45b-185e3ed65727/16f08231 >>>>>>>>>>>>>> -93b0-489d-a2fd-687b6bf88eaa >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/12924435-b9c2-4 >>>>>>>>>>>>>> aab-ba19-1c1bc31310ef/07b3db69 >>>>>>>>>>>>>> -440e-491e-854c-bbfa18a7cff2 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/f3c8e7aa-6ef2-4 >>>>>>>>>>>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad >>>>>>>>>>>>>> -2606-44ef-9cd3-ae59610a504b >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/647be733-f153-4 >>>>>>>>>>>>>> cdc-85bd-ba72544c2631/b453a300 >>>>>>>>>>>>>> -0602-4be1-8310-8bd5abe00971 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/48d7ecb8-7ac5-4 >>>>>>>>>>>>>> 725-bca5-b3519681cf2f/0d6080b0 >>>>>>>>>>>>>> -7018-4fa3-bb82-1dd9ef07d9b9 >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/6da854d1-b6be-4 >>>>>>>>>>>>>> 46b-9bf0-90a0dbbea830/3c93bd1f >>>>>>>>>>>>>> -b7fa-4aa2-b445-6904e31839ba >>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7 >>>>>>>>>>>>>> c3ae46c/images/7f647567-d18c-4 >>>>>>>>>>>>>> 4f1-a58e-9b8865833acb/f9364470 >>>>>>>>>>>>>> -9770-4bb1-a6b9-a54861849625 >>>>>>>>>>>>>> Status: Connected >>>>>>>>>>>>>> Number of entries: 8 >>>>>>>>>>>>>> >>>>>>>>>>>>>> --------- >>>>>>>>>>>>>> Its been in this state for a couple days now, and >>>>>>>>>>>>>> bandwidth monitoring shows no appreciable data moving. I've tried >>>>>>>>>>>>>> repeatedly commanding a full heal from all three clusters in the node. Its >>>>>>>>>>>>>> always the same files that need healing. >>>>>>>>>>>>>> >>>>>>>>>>>>>> When running gluster volume heal data-hdd statistics, I >>>>>>>>>>>>>> see sometimes different information, but always some number of "heal >>>>>>>>>>>>>> failed" entries. It shows 0 for split brain. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not quite sure what to do. I suspect it may be due >>>>>>>>>>>>>> to nodes 1 and 2 still being on the older ovirt/gluster release, but I'm >>>>>>>>>>>>>> afraid to upgrade and reboot them until I have a good gluster sync (don't >>>>>>>>>>>>>> need to create a split brain issue). How do I proceed with this? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Second issue: I've been experiencing VERY POOR >>>>>>>>>>>>>> performance on most of my VMs. To the tune that logging into a windows 10 >>>>>>>>>>>>>> vm via remote desktop can take 5 minutes, launching quickbooks inside said >>>>>>>>>>>>>> vm can easily take 10 minutes. On some linux VMs, I get random messages >>>>>>>>>>>>>> like this: >>>>>>>>>>>>>> Message from syslogd@unifi at May 28 20:39:23 ... >>>>>>>>>>>>>> kernel:[6171996.308904] NMI watchdog: BUG: soft lockup >>>>>>>>>>>>>> - CPU#0 stuck for 22s! [mongod:14766] >>>>>>>>>>>>>> >>>>>>>>>>>>>> (the process and PID are often different) >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not quite sure what to do about this either. My >>>>>>>>>>>>>> initial thought was upgrad everything to current and see if its still >>>>>>>>>>>>>> there, but I cannot move forward with that until my gluster is healed... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>> --Jim >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Users mailing list -- users@ovirt.org >>>>>>>>>>>>>> To unsubscribe send an email to users-leave@ovirt.org >>>>>>>>>>>>>> Privacy Statement: https://www.ovirt.org/site/pri >>>>>>>>>>>>>> vacy-policy/ >>>>>>>>>>>>>> oVirt Code of Conduct: https://www.ovirt.org/communit >>>>>>>>>>>>>> y/about/community-guidelines/ >>>>>>>>>>>>>> List Archives: https://lists.ovirt.org/archiv >>>>>>>>>>>>>> es/list/users@ovirt.org/messag >>>>>>>>>>>>>> e/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Users mailing list -- users@ovirt.org >>>>>>>>> To unsubscribe send an email to users-leave@ovirt.org >>>>>>>>> Privacy Statement: https://www.ovirt.org/site/pri >>>>>>>>> vacy-policy/ >>>>>>>>> oVirt Code of Conduct: https://www.ovirt.org/communit >>>>>>>>> y/about/community-guidelines/ >>>>>>>>> List Archives: https://lists.ovirt.org/archiv >>>>>>>>> es/list/users@ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDE >>>>>>>>> S5MF/ >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> Users mailing list -- users@ovirt.org >>> To unsubscribe send an email to users-leave@ovirt.org >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: https://www.ovirt.org/communit >>> y/about/community-guidelines/ >>> List Archives: https://lists.ovirt.org/archiv >>> es/list/users@ovirt.org/message/JKBSLSZFRMFGILXGZB4YAMJIJKEKNGMA/ >>> >>> >> >