Ok, made it through this one finally. Turns out there's a little log file
"sanlock.log" that had the magic info in it...the ...dom_md/ids file got
messed with in the gluster crash, had to recover that from a backup, then
run the hosted-engine --reinitialize-lockspace --force, and then it started
working again. Now, for some reason, my hosted engine starts up, but has
no networking and the hosted-engine --console doesn't work either...Still
looking into this now.
On Wed, May 30, 2018 at 1:29 AM, Jim Kusznir <jim(a)palousetech.com> wrote:
So, right now I think this message from vdsm.log captures my
blocking
problem:
2018-05-30 01:27:22,326-0700 INFO (jsonrpc/5) [vdsm.api] FINISH repoStats
return={u'c0acdefb-7d16-48ec-9d76-659b8fe33e2a': {'code': 0,
'actual':
True, 'version': 4, 'acquired': False, 'delay':
'0.00142076', 'lastCheck':
'1.1', 'valid': True}} from=::1,60684,
task_id=7da641b6-9a98-4db7-aafb-04ff84228cf5
(api:52)
2018-05-30 01:27:22,326-0700 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC
call Host.getStorageRepoStats succeeded in 0.00 seconds (__init__:573)
2018-05-30 01:27:23,411-0700 ERROR (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') The vm start process failed
(vm:943)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in
_startUnderlyingVm
self._run()
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in
_run
dom.createWithFlags(flags)
File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py",
line 130, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line
92, in wrapper
return func(inst, *args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in
createWithFlags
if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed',
dom=self)
libvirtError: Failed to acquire lock: No space left on device
2018-05-30 01:27:23,411-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') Changed state to Down:
Failed to acquire lock: No space left on device (code=1) (vm:1683)
The strange thing is I can't figure out what device has no space left.
The only hits on google talk about deleting __DIRECT_IO_TEST__ file in the
storage repo, which I tried, but it did not help. Perhaps one of the files
in the engine volume stores locks, and that is full/corrupt?
--Jim
On Wed, May 30, 2018 at 12:25 AM, Jim Kusznir <jim(a)palousetech.com> wrote:
> I tried to capture a consistent log slice accros the cluster from
> agent.log. Several systems claim they are starting the engine, but virsh
> never shows any sign of it. They also claim that a uuid that I'm pretty
> sure is not affiliated with my engine is not found.
>
> ---------
> ovirt1:
> ---------
> MainThread::INFO::2018-05-30 00:18:11,628::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineForceStop-EngineDown) sent? sent
> [root@ovirt1 vdsm]# tail -n 20 ../ovirt-hosted-engine-ha/agent.log
> MainThread::INFO::2018-05-30 00:14:02,154::hosted_engine::4
> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Current state EngineStarting (score: 3400)
> MainThread::INFO::2018-05-30 00:14:02,173::states::776::ovi
> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Another
> host already took over..
> MainThread::INFO::2018-05-30 00:14:02,174::state_decorators
> ::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> Timeout cleared while transitioning <class
'ovirt_hosted_engine_ha.agent.states.EngineStarting'>
> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineForceStop'>
> MainThread::INFO::2018-05-30 00:14:02,533::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineStarting-EngineForceStop) sent?
> sent
> MainThread::INFO::2018-05-30 00:14:02,843::ovf_store::151::
> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-
> 809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
> MainThread::INFO::2018-05-30 00:18:09,236::hosted_engine::4
> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Current state EngineForceStop (score: 3400)
> MainThread::INFO::2018-05-30 00:18:09,236::hosted_engine::4
> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Best remote host
ovirt3.nwfiber.com (id: 3, score: 3400)
> MainThread::INFO::2018-05-30 00:18:09,253::hosted_engine::1007::
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
> Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff`
> MainThread::INFO::2018-05-30 00:18:11,159::hosted_engine::1012::
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
> stdout:
> MainThread::INFO::2018-05-30 00:18:11,160::hosted_engine::1013::
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
> stderr:
> MainThread::ERROR::2018-05-30 00:18:11,160::hosted_engine::1021::
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
> Engine VM stopped on localhost
> MainThread::INFO::2018-05-30 00:18:11,628::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineForceStop-EngineDown) sent? sent
> MainThread::INFO::2018-05-30 00:22:17,692::hosted_engine::4
> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Current state EngineDown (score: 3400)
> MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1
> 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Global metadata: {'maintenance': False}
> MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1
> 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host
ovirt2.nwfiber.com (id 2): {'conf_on_shared_storage': True,
> 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4348
> (Wed May 30 00:16:46 2018)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=4349
> (Wed May 30 00:16:47 2018)\nconf_on_shared_storage=
> True\nmaintenance=False\nstate=EngineDown\nstopped=False\n', 'hostname':
> 'ovirt2.nwfiber.com', 'alive': True, 'host-id': 2,
'engine-status':
> {'reason': 'bad vm status', 'health': 'bad',
'vm': 'down_unexpected',
> 'detail': 'Down'}, 'score': 3400, 'stopped': False,
'maintenance': False,
> 'crc32': '5ee0a557', 'local_conf_timestamp': 4349,
'host-ts': 4348}
> MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1
> 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host
ovirt3.nwfiber.com (id 3): {'conf_on_shared_storage': True,
> 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=100036
> (Wed May 30 00:16:58 2018)\nhost-id=3\nscore=3400\nvm_conf_refresh_time=100037
> (Wed May 30 00:16:58 2018)\nconf_on_shared_storage=
> True\nmaintenance=False\nstate=EngineStart\nstopped=False\n',
> 'hostname': 'ovirt3.nwfiber.com', 'alive': True,
'host-id': 3,
> 'engine-status': {'reason': 'bad vm status',
'health': 'bad', 'vm':
> 'down_unexpected', 'detail': 'Down'}, 'score': 3400,
'stopped': False,
> 'maintenance': False, 'crc32': 'e2fadf31',
'local_conf_timestamp': 100037,
> 'host-ts': 100036}
> MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1
> 77::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Local (id 1): {'engine-health': {'reason': 'vm not running on
this host',
> 'health': 'bad', 'vm': 'down', 'detail':
'unknown'}, 'bridge': True,
> 'mem-free': 30186.0, 'maintenance': False, 'cpu-load':
0.0308, 'gateway':
> 1.0, 'storage-domain': True}
> MainThread::INFO::2018-05-30 00:22:17,714::states::508::ovi
> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
> down and local host has best score (3400), attempting to start engine VM
> MainThread::INFO::2018-05-30 00:22:18,309::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineDown-EngineStart) sent? sent
> MainThread::INFO::2018-05-30 00:22:18,624::ovf_store::151::
> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-
> 809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
>
> ----------
> ovirt2
> ----------
> MainThread::INFO::2018-05-30 00:08:34,534::hosted_engine::4
> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Current state EngineUnexpectedlyDown (score: 0)
> MainThread::INFO::2018-05-30 00:08:34,534::hosted_engine::4
> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Best remote host
ovirt3.nwfiber.com (id: 3, score: 3400)
> MainThread::INFO::2018-05-30 00:08:34,545::states::678::ovi
> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
> down, local host does not have best score
> MainThread::INFO::2018-05-30 00:12:40,627::states::693::ovi
> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0
> due to unexpected vm shutdown at Wed May 30 00:04:28 2018
> MainThread::INFO::2018-05-30 00:12:40,628::hosted_engine::4
> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Current state EngineUnexpectedlyDown (score: 0)
> MainThread::INFO::2018-05-30 00:12:40,628::hosted_engine::4
> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Best remote host
ovirt1.nwfiber.com (id: 1, score: 3400)
> MainThread::INFO::2018-05-30 00:12:40,643::states::678::ovi
> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
> down, local host does not have best score
> MainThread::INFO::2018-05-30 00:12:40,827::ovf_store::151::
> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-
> 809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
> MainThread::INFO::2018-05-30 00:16:46,635::states::693::ovi
> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0
> due to unexpected vm shutdown at Wed May 30 00:04:28 2018
> MainThread::INFO::2018-05-30 00:16:46,636::hosted_engine::4
> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Current state EngineUnexpectedlyDown (score: 0)
> MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1
> 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Global metadata: {'maintenance': False}
> MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1
> 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host
ovirt1.nwfiber.com (id 1): {'conf_on_shared_storage': True,
> 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=7614
> (Wed May 30 00:09:54 2018)\nhost-id=1\nscore=3400\nvm_conf_refresh_time=7614
> (Wed May 30 00:09:54 2018)\nconf_on_shared_storage=
> True\nmaintenance=False\nstate=EngineStarting\nstopped=False\n',
> 'hostname': 'ovirt1.nwfiber.com', 'alive': True,
'host-id': 1,
> 'engine-status': {'reason': 'vm not running on this host',
'health': 'bad',
> 'vm': 'down', 'detail': 'unknown'}, 'score':
3400, 'stopped': False,
> 'maintenance': False, 'crc32': '1fa13e84',
'local_conf_timestamp': 7614,
> 'host-ts': 7614}
> MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1
> 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Host
ovirt3.nwfiber.com (id 3): {'conf_on_shared_storage': True,
> 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=99541
> (Wed May 30 00:08:43 2018)\nhost-id=3\nscore=3400\nvm_conf_refresh_time=99542
> (Wed May 30 00:08:43 2018)\nconf_on_shared_storage=
> True\nmaintenance=False\nstate=EngineForceStop\nstopped=False\n',
> 'hostname': 'ovirt3.nwfiber.com', 'alive': True,
'host-id': 3,
> 'engine-status': {'reason': 'bad vm status',
'health': 'bad', 'vm': 'down',
> 'detail': 'Down'}, 'score': 3400, 'stopped': False,
'maintenance': False,
> 'crc32': '56f87e2c', 'local_conf_timestamp': 99542,
'host-ts': 99541}
> MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1
> 77::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Local (id 2): {'engine-health': {'reason': 'bad vm status',
'health':
> 'bad', 'vm': 'down_unexpected', 'detail':
'Down'}, 'bridge': True,
> 'mem-free': 30891.0, 'maintenance': False, 'cpu-load':
0.0186, 'gateway':
> 1.0, 'storage-domain': True}
> MainThread::INFO::2018-05-30 00:16:46,919::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineUnexpectedlyDown-EngineDown)
> sent? sent
> MainThread::INFO::2018-05-30 00:20:52,918::hosted_engine::4
> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Current state EngineDown (score: 3400)
> MainThread::INFO::2018-05-30 00:20:52,918::hosted_engine::4
> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Best remote host
ovirt1.nwfiber.com (id: 1, score: 3400)
> MainThread::INFO::2018-05-30 00:20:52,930::states::508::ovi
> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
> down and local host has best score (3400), attempting to start engine VM
> MainThread::INFO::2018-05-30 00:20:53,203::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineDown-EngineStart) sent? sent
> MainThread::INFO::2018-05-30 00:20:53,393::ovf_store::151::
> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-
> 809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
>
> --------
> ovirt3
> -------
> MainThread::INFO::2018-05-30 00:21:07,343::hosted_engine::9
> 29::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
> Starting vm using `/usr/sbin/hosted-engine --vm-start`
> MainThread::INFO::2018-05-30 00:21:08,107::hosted_engine::9
> 35::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
> stdout:
> MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9
> 36::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
> stderr: Command VM.getStats with args {'vmID':
> '50392390-4f89-435c-bf3b-8254b58a4ef7'} failed:
> (code=1, message=Virtual machine does not exist: {'vmId':
> u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
>
> MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9
> 48::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
> Engine VM started on localhost
> MainThread::INFO::2018-05-30 00:21:08,413::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineStart-EngineStarting) sent? sent
> [root@ovirt3 images]# tail -n 20 /var/log/ovirt-hosted-engine-h
> a/agent.log
> MainThread::INFO::2018-05-30 00:12:51,792::hosted_engine::1012::
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
> stdout:
> MainThread::INFO::2018-05-30 00:12:51,792::hosted_engine::1013::
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
> stderr:
> MainThread::ERROR::2018-05-30 00:12:51,793::hosted_engine::1021::
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
> Engine VM stopped on localhost
> MainThread::INFO::2018-05-30 00:12:52,157::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineForceStop-EngineDown) sent? sent
> MainThread::INFO::2018-05-30 00:16:58,237::hosted_engine::4
> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Current state EngineDown (score: 3400)
> MainThread::INFO::2018-05-30 00:16:58,256::states::508::ovi
> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
> down and local host has best score (3400), attempting to start engine VM
> MainThread::INFO::2018-05-30 00:16:58,595::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineDown-EngineStart) sent? sent
> MainThread::INFO::2018-05-30 00:16:58,816::ovf_store::151::
> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-
> 809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
> MainThread::INFO::2018-05-30 00:21:05,082::hosted_engine::4
> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Current state EngineStart (score: 3400)
> MainThread::INFO::2018-05-30 00:21:05,083::hosted_engine::4
> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> Best remote host
ovirt1.nwfiber.com (id: 1, score: 3400)
> MainThread::INFO::2018-05-30 00:21:05,125::hosted_engine::9
> 68::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
> Ensuring VDSM state is clear for engine VM
> MainThread::INFO::2018-05-30 00:21:05,141::hosted_engine::9
> 88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
> Cleaning state for non-running VM
> MainThread::INFO::2018-05-30 00:21:07,343::hosted_engine::9
> 80::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
> Vdsm state for VM clean
> MainThread::INFO::2018-05-30 00:21:07,343::hosted_engine::9
> 29::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
> Starting vm using `/usr/sbin/hosted-engine --vm-start`
> MainThread::INFO::2018-05-30 00:21:08,107::hosted_engine::9
> 35::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
> stdout:
> MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9
> 36::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
> stderr: Command VM.getStats with args {'vmID':
> '50392390-4f89-435c-bf3b-8254b58a4ef7'} failed:
> (code=1, message=Virtual machine does not exist: {'vmId':
> u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
>
> MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9
> 48::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
> Engine VM started on localhost
> MainThread::INFO::2018-05-30 00:21:08,413::brokerlink::68::
> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
> notification of state_transition (EngineStart-EngineStarting) sent? sent
>
> ----------
> note that virsh does not show engine started on this machine....
>
> On Wed, May 30, 2018 at 12:16 AM, Jim Kusznir <jim(a)palousetech.com>
> wrote:
>
>> So, on ovirt2, I ran the hosted-engine --vm-status and it took over 2
>> minutes to get it to return, but two of the 3 machines had current
>> (non-stale) data:
>>
>> [root@ovirt2 ~]# hosted-engine --vm-status
>>
>>
>> --== Host 1 status ==--
>>
>> conf_on_shared_storage : True
>> Status up-to-date : False
>> Hostname :
ovirt1.nwfiber.com
>> Host ID : 1
>> Engine status : unknown stale-data
>> Score : 3400
>> stopped : False
>> Local maintenance : False
>> crc32 : f562054b
>> local_conf_timestamp : 7368
>> Host timestamp : 7367
>> Extra metadata (valid at timestamp):
>> metadata_parse_version=1
>> metadata_feature_version=1
>> timestamp=7367 (Wed May 30 00:05:47 2018)
>> host-id=1
>> score=3400
>> vm_conf_refresh_time=7368 (Wed May 30 00:05:48 2018)
>> conf_on_shared_storage=True
>> maintenance=False
>> state=EngineStart
>> stopped=False
>>
>>
>> --== Host 2 status ==--
>>
>> conf_on_shared_storage : True
>> Status up-to-date : True
>> Hostname :
ovirt2.nwfiber.com
>> Host ID : 2
>> Engine status : {"reason": "bad vm
status",
>> "health": "bad", "vm": "down_unexpected",
"detail": "Down"}
>> Score : 0
>> stopped : False
>> Local maintenance : False
>> crc32 : 6822f86a
>> local_conf_timestamp : 3610
>> Host timestamp : 3610
>> Extra metadata (valid at timestamp):
>> metadata_parse_version=1
>> metadata_feature_version=1
>> timestamp=3610 (Wed May 30 00:04:28 2018)
>> host-id=2
>> score=0
>> vm_conf_refresh_time=3610 (Wed May 30 00:04:28 2018)
>> conf_on_shared_storage=True
>> maintenance=False
>> state=EngineUnexpectedlyDown
>> stopped=False
>> timeout=Wed Dec 31 17:09:30 1969
>>
>>
>> --== Host 3 status ==--
>>
>> conf_on_shared_storage : True
>> Status up-to-date : True
>> Hostname :
ovirt3.nwfiber.com
>> Host ID : 3
>> Engine status : {"reason": "vm not running on
this
>> host", "health": "bad", "vm":
"down", "detail": "unknown"}
>> Score : 3400
>> stopped : False
>> Local maintenance : False
>> crc32 : 006512a3
>> local_conf_timestamp : 99294
>> Host timestamp : 99294
>> Extra metadata (valid at timestamp):
>> metadata_parse_version=1
>> metadata_feature_version=1
>> timestamp=99294 (Wed May 30 00:04:36 2018)
>> host-id=3
>> score=3400
>> vm_conf_refresh_time=99294 (Wed May 30 00:04:36 2018)
>> conf_on_shared_storage=True
>> maintenance=False
>> state=EngineStarting
>> stopped=False
>> ---------------
>>
>> I saw that the ha score was 0 on ovirt2 (where I was focusing my
>> efforts), and ovirt3 had a good ha score, I tried to launch it there:
>>
>> [root@ovirt3 ff3d0185-d120-40ee-9792-ebbd5caaca8b]# hosted-engine
>> --vm-start
>> Command VM.getStats with args {'vmID':
'50392390-4f89-435c-bf3b-8254b58a4ef7'}
>> failed:
>> (code=1, message=Virtual machine does not exist: {'vmId':
>> u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
>>
>>
>> ovirt1 was returning the ha-agent or storage not online message, but now
>> its hanging (probably 2 minutes), so it may actually return status this
>> time. I figure if it can't return status, then there's no point in
trying
>> to start the VM there.
>>
>>
>>
>> On Wed, May 30, 2018 at 12:08 AM, Jim Kusznir <jim(a)palousetech.com>
>> wrote:
>>
>>> The agent..log was sent minutes ago in another message. Here's the
>>> gluster client log for the engine volume:
>>>
>>> [2018-05-30 06:31:12.343709] I [MSGID: 108026]
>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> 0-engine-replicate-0: performing metadata selfheal on
>>> 3a430e50-48cb-463c-8657-aab5458ac155
>>> [2018-05-30 06:31:12.348210] W [MSGID: 108015]
>>> [afr-self-heal-entry.c:56:afr_selfheal_entry_delete]
>>> 0-engine-replicate-0: expunging file
117c5234-e6d6-4f03-809b-a4ef6a6fac4d/hosted-engine.metadata
>>> (38a277cd-113d-4ca8-8f48-fadc944d3272) on engine-client-1
>>> [2018-05-30 06:31:12.360602] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>> Completed metadata selfheal on 3a430e50-48cb-463c-8657-aab5458ac155.
>>> sources=[0] 2 sinks=1
>>> [2018-05-30 06:31:12.379169] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>> Completed entry selfheal on 117c5234-e6d6-4f03-809b-a4ef6a6fac4d.
>>> sources=[0] 2 sinks=1
>>> [2018-05-30 06:32:54.756729] I [MSGID: 108026]
>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> 0-engine-replicate-0: performing metadata selfheal on
>>> 2ecf0f33-1385-416e-8619-129bdb62b61f
>>> [2018-05-30 06:32:54.763175] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>> Completed metadata selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>> sources=[0] 2 sinks=1
>>> [2018-05-30 06:37:06.581538] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>> Completed data selfheal on 06c557a9-b550-4e25-b190-804e50cea65b.
>>> sources=[0] 2 sinks=1
>>> [2018-05-30 06:39:55.739302] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>> sources=[0] 2 sinks=1
>>> [2018-05-30 06:44:04.911825] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>> sources=[0] 2 sinks=1
>>> [2018-05-30 06:58:05.123854] I [MSGID: 108026]
>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> 0-engine-replicate-0: performing metadata selfheal on
>>> 03ca7a59-e8ee-4a70-bc68-47e972aeb801
>>> [2018-05-30 06:58:05.131929] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>> Completed metadata selfheal on 03ca7a59-e8ee-4a70-bc68-47e972aeb801.
>>> sources=[0] 2 sinks=1
>>> [2018-05-30 07:01:42.583991] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>> sources=[0] 2 sinks=1
>>> [2018-05-30 07:05:48.777982] I [MSGID: 108026]
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>> sources=[0] 2 sinks=1
>>>
>>> --------
>>> broker log:
>>> --------
>>> Thread-1::INFO::2018-05-30 00:07:13,241::ping::60::ping.Ping::(action)
>>> Successfully pinged 192.168.8.1
>>> Listener::INFO::2018-05-30 00:07:14,344::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:07:19,356::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Thread-5::INFO::2018-05-30 00:07:22,081::engine_health::2
>>> 02::engine_health.EngineHealth::(_result_from_stats) VM not running on
>>> this host, status Down
>>> Listener::INFO::2018-05-30 00:07:24,369::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:07:29,382::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:07:34,393::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:07:39,406::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:07:44,419::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:07:49,431::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:07:54,444::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:07:59,452::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Thread-2::INFO::2018-05-30 00:08:03,736::mgmt_bridge::62:
>>> :mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports
>>> Listener::INFO::2018-05-30 00:08:04,460::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:08:09,468::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Thread-4::INFO::2018-05-30 00:08:11,461::cpu_load_no_engi
>>> ne::116::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) VM stats
>>> do not contain cpu usage. VM might be down.
>>> Thread-4::INFO::2018-05-30 00:08:11,461::cpu_load_no_engi
>>> ne::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System
>>> load total=0.0169, engine=0.0000, non-engine=0.0169
>>> Listener::INFO::2018-05-30 00:08:14,481::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Listener::INFO::2018-05-30 00:08:19,490::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>> Thread-5::INFO::2018-05-30 00:08:22,175::engine_health::2
>>> 02::engine_health.EngineHealth::(_result_from_stats) VM not running on
>>> this host, status Down
>>> Thread-1::INFO::2018-05-30 00:08:22,436::ping::60::ping.Ping::(action)
>>> Successfully pinged 192.168.8.1
>>> Listener::INFO::2018-05-30 00:08:24,503::storage_broker::304::
>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>
>>>
>>> On Wed, May 30, 2018 at 12:03 AM, Sahina Bose <sabose(a)redhat.com>
>>> wrote:
>>>
>>>> [Adding gluster-users back]
>>>>
>>>> Nothing amiss with volume info and status.
>>>> Can you check the agent.log and broker.log - will be under
>>>> /var/log/ovirt-hosted-engine-ha/
>>>>
>>>> Also the gluster client logs - under /var/log/glusterfs/rhev-data-c
>>>> enter-mnt-glusterSD<volume>.log
>>>>
>>>>
>>>>
>>>> On Wed, May 30, 2018 at 12:08 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>> wrote:
>>>>
>>>>> I believe the gluster data store for the engine is up and working
>>>>> correctly. The rest are not mounted, as the engine hasn't
managed to start
>>>>> correctly yet. I did perform the copy-a-junk-file into the data
store,
>>>>> then check to ensure its there on another host, then delete that and
see
>>>>> that it disappeared on the first host; it passed that test.
Here's the
>>>>> info and status. (I have NOT performed the steps that Krutika and
>>>>> Ravishankar suggested yet, as I don't have my data volumes
working again
>>>>> yet.
>>>>>
>>>>> [root@ovirt2 images]# gluster volume info
>>>>>
>>>>> Volume Name: data
>>>>> Type: Replicate
>>>>> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
>>>>> Status: Started
>>>>> Snapshot Count: 0
>>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
>>>>> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
>>>>> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
>>>>> Options Reconfigured:
>>>>> diagnostics.count-fop-hits: on
>>>>> diagnostics.latency-measurement: on
>>>>> changelog.changelog: on
>>>>> geo-replication.ignore-pid-check: on
>>>>> geo-replication.indexing: on
>>>>> server.allow-insecure: on
>>>>> performance.readdir-ahead: on
>>>>> performance.quick-read: off
>>>>> performance.read-ahead: off
>>>>> performance.io-cache: off
>>>>> performance.stat-prefetch: off
>>>>> cluster.eager-lock: enable
>>>>> network.remote-dio: enable
>>>>> cluster.quorum-type: auto
>>>>> cluster.server-quorum-type: server
>>>>> storage.owner-uid: 36
>>>>> storage.owner-gid: 36
>>>>> features.shard: on
>>>>> features.shard-block-size: 512MB
>>>>> performance.low-prio-threads: 32
>>>>> cluster.data-self-heal-algorithm: full
>>>>> cluster.locking-scheme: granular
>>>>> cluster.shd-wait-qlength: 10000
>>>>> cluster.shd-max-threads: 8
>>>>> network.ping-timeout: 30
>>>>> user.cifs: off
>>>>> nfs.disable: on
>>>>> performance.strict-o-direct: on
>>>>>
>>>>> Volume Name: data-hdd
>>>>> Type: Replicate
>>>>> Volume ID: d342a3ab-16f3-49f0-bbcf-f788be8ac5f1
>>>>> Status: Started
>>>>> Snapshot Count: 0
>>>>> Number of Bricks: 1 x 3 = 3
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: 172.172.1.11:/gluster/brick3/data-hdd
>>>>> Brick2: 172.172.1.12:/gluster/brick3/data-hdd
>>>>> Brick3: 172.172.1.13:/gluster/brick3/data-hdd
>>>>> Options Reconfigured:
>>>>> diagnostics.count-fop-hits: on
>>>>> diagnostics.latency-measurement: on
>>>>> network.ping-timeout: 30
>>>>> server.allow-insecure: on
>>>>> storage.owner-gid: 36
>>>>> storage.owner-uid: 36
>>>>> user.cifs: off
>>>>> features.shard: on
>>>>> cluster.shd-wait-qlength: 10000
>>>>> cluster.shd-max-threads: 8
>>>>> cluster.locking-scheme: granular
>>>>> cluster.data-self-heal-algorithm: full
>>>>> cluster.server-quorum-type: server
>>>>> cluster.quorum-type: auto
>>>>> cluster.eager-lock: enable
>>>>> network.remote-dio: enable
>>>>> performance.low-prio-threads: 32
>>>>> performance.stat-prefetch: off
>>>>> performance.io-cache: off
>>>>> performance.read-ahead: off
>>>>> performance.quick-read: off
>>>>> changelog.changelog: on
>>>>> geo-replication.ignore-pid-check: on
>>>>> geo-replication.indexing: on
>>>>> transport.address-family: inet
>>>>> performance.readdir-ahead: on
>>>>>
>>>>> Volume Name: engine
>>>>> Type: Replicate
>>>>> Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de
>>>>> Status: Started
>>>>> Snapshot Count: 0
>>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine
>>>>> Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine
>>>>> Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter)
>>>>> Options Reconfigured:
>>>>> changelog.changelog: on
>>>>> geo-replication.ignore-pid-check: on
>>>>> geo-replication.indexing: on
>>>>> performance.readdir-ahead: on
>>>>> performance.quick-read: off
>>>>> performance.read-ahead: off
>>>>> performance.io-cache: off
>>>>> performance.stat-prefetch: off
>>>>> cluster.eager-lock: enable
>>>>> network.remote-dio: off
>>>>> cluster.quorum-type: auto
>>>>> cluster.server-quorum-type: server
>>>>> storage.owner-uid: 36
>>>>> storage.owner-gid: 36
>>>>> features.shard: on
>>>>> features.shard-block-size: 512MB
>>>>> performance.low-prio-threads: 32
>>>>> cluster.data-self-heal-algorithm: full
>>>>> cluster.locking-scheme: granular
>>>>> cluster.shd-wait-qlength: 10000
>>>>> cluster.shd-max-threads: 6
>>>>> network.ping-timeout: 30
>>>>> user.cifs: off
>>>>> nfs.disable: on
>>>>> performance.strict-o-direct: on
>>>>>
>>>>> Volume Name: iso
>>>>> Type: Replicate
>>>>> Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92
>>>>> Status: Started
>>>>> Snapshot Count: 0
>>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso
>>>>> Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso
>>>>> Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter)
>>>>> Options Reconfigured:
>>>>> performance.strict-o-direct: on
>>>>> nfs.disable: on
>>>>> user.cifs: off
>>>>> network.ping-timeout: 30
>>>>> cluster.shd-max-threads: 6
>>>>> cluster.shd-wait-qlength: 10000
>>>>> cluster.locking-scheme: granular
>>>>> cluster.data-self-heal-algorithm: full
>>>>> performance.low-prio-threads: 32
>>>>> features.shard-block-size: 512MB
>>>>> features.shard: on
>>>>> storage.owner-gid: 36
>>>>> storage.owner-uid: 36
>>>>> cluster.server-quorum-type: server
>>>>> cluster.quorum-type: auto
>>>>> network.remote-dio: off
>>>>> cluster.eager-lock: enable
>>>>> performance.stat-prefetch: off
>>>>> performance.io-cache: off
>>>>> performance.read-ahead: off
>>>>> performance.quick-read: off
>>>>> performance.readdir-ahead: on
>>>>>
>>>>>
>>>>> [root@ovirt2 images]# gluster volume status
>>>>> Status of volume: data
>>>>> Gluster process TCP Port RDMA Port
>>>>> Online Pid
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> Brick ovirt1.nwfiber.com:/gluster/brick2/da
>>>>> ta 49152 0 Y
>>>>> 3226
>>>>> Brick ovirt2.nwfiber.com:/gluster/brick2/da
>>>>> ta 49152 0 Y
>>>>> 2967
>>>>> Brick ovirt3.nwfiber.com:/gluster/brick2/da
>>>>> ta 49152 0 Y
>>>>> 2554
>>>>> Self-heal Daemon on localhost N/A N/A Y
>>>>> 4818
>>>>> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
>>>>> 17853
>>>>> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
>>>>> 4771
>>>>>
>>>>> Task Status of Volume data
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> There are no active volume tasks
>>>>>
>>>>> Status of volume: data-hdd
>>>>> Gluster process TCP Port RDMA Port
>>>>> Online Pid
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> Brick 172.172.1.11:/gluster/brick3/data-hdd 49153 0 Y
>>>>> 3232
>>>>> Brick 172.172.1.12:/gluster/brick3/data-hdd 49153 0 Y
>>>>> 2976
>>>>> Brick 172.172.1.13:/gluster/brick3/data-hdd N/A N/A N
>>>>> N/A
>>>>> NFS Server on localhost N/A N/A N
>>>>> N/A
>>>>> Self-heal Daemon on localhost N/A N/A Y
>>>>> 4818
>>>>> NFS Server on
ovirt3.nwfiber.com N/A N/A N
>>>>> N/A
>>>>> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
>>>>> 17853
>>>>> NFS Server on
ovirt1.nwfiber.com N/A N/A N
>>>>> N/A
>>>>> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
>>>>> 4771
>>>>>
>>>>> Task Status of Volume data-hdd
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> There are no active volume tasks
>>>>>
>>>>> Status of volume: engine
>>>>> Gluster process TCP Port RDMA Port
>>>>> Online Pid
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> Brick ovirt1.nwfiber.com:/gluster/brick1/en
>>>>> gine 49154 0 Y
>>>>> 3239
>>>>> Brick ovirt2.nwfiber.com:/gluster/brick1/en
>>>>> gine 49154 0 Y
>>>>> 2982
>>>>> Brick ovirt3.nwfiber.com:/gluster/brick1/en
>>>>> gine 49154 0 Y
>>>>> 2578
>>>>> Self-heal Daemon on localhost N/A N/A Y
>>>>> 4818
>>>>> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
>>>>> 17853
>>>>> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
>>>>> 4771
>>>>>
>>>>> Task Status of Volume engine
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> There are no active volume tasks
>>>>>
>>>>> Status of volume: iso
>>>>> Gluster process TCP Port RDMA Port
>>>>> Online Pid
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> Brick ovirt1.nwfiber.com:/gluster/brick4/is
>>>>> o 49155 0 Y
>>>>> 3247
>>>>> Brick ovirt2.nwfiber.com:/gluster/brick4/is
>>>>> o 49155 0 Y
>>>>> 2990
>>>>> Brick ovirt3.nwfiber.com:/gluster/brick4/is
>>>>> o 49155 0 Y
>>>>> 2580
>>>>> Self-heal Daemon on localhost N/A N/A Y
>>>>> 4818
>>>>> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
>>>>> 17853
>>>>> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
>>>>> 4771
>>>>>
>>>>> Task Status of Volume iso
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> There are no active volume tasks
>>>>>
>>>>>
>>>>> On Tue, May 29, 2018 at 11:30 PM, Sahina Bose
<sabose(a)redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 30, 2018 at 10:42 AM, Jim Kusznir
<jim(a)palousetech.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hosted-engine --deploy failed (would not come up on my
existing
>>>>>>> gluster storage). However, I realized no changes were
written to my
>>>>>>> existing storage. So, I went back to trying to get my old
engine running.
>>>>>>>
>>>>>>> hosted-engine --vm-status is now taking a very long time
>>>>>>> (5+minutes) to return, and it returns stail information
everywhere. I
>>>>>>> thought perhaps the lockspace is corrupt, so tried to clean
that and
>>>>>>> metadata, but both are failing (--cleam-metadata has hung and
I can't even
>>>>>>> ctrl-c out of it).
>>>>>>>
>>>>>>> How can I reinitialize all the lockspace/metadata safely?
There is
>>>>>>> no engine or VMs running currently....
>>>>>>>
>>>>>>
>>>>>> I think the first thing to make sure is that your storage is up
and
>>>>>> running. So can you mount the gluster volumes and able to access
the
>>>>>> contents there?
>>>>>> Please provide the gluster volume info and gluster volume status
of
>>>>>> the volumes that you're using.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> --Jim
>>>>>>>
>>>>>>> On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Well, things went from bad to very, very bad....
>>>>>>>>
>>>>>>>> It appears that during one of the 2 minute lockups, the
fencing
>>>>>>>> agents decided that another node in the cluster was down.
As a result, 2
>>>>>>>> of the 3 nodes were simultaneously reset with fencing
agent reboot. After
>>>>>>>> the nodes came back up, the engine would not start. All
running VMs
>>>>>>>> (including VMs on the 3rd node that was not rebooted)
crashed.
>>>>>>>>
>>>>>>>> I've now been working for about 3 hours trying to get
the engine
>>>>>>>> to come up. I don't know why it won't start.
hosted-engine --vm-start
>>>>>>>> says its starting, but it doesn't start (virsh
doesn't show any VMs
>>>>>>>> running). I'm currently running --deploy, as I had
run out of options for
>>>>>>>> anything else I can come up with. I hope this will allow
me to re-import
>>>>>>>> all my existing VMs and allow me to start them back up
after everything
>>>>>>>> comes back up.
>>>>>>>>
>>>>>>>> I do have an unverified geo-rep backup; I don't know
if it is a
>>>>>>>> good backup (there were several prior messages to this
list, but I didn't
>>>>>>>> get replies to my questions. It was running in what I
believe to be
>>>>>>>> "strange", and the data directories are larger
than their source).
>>>>>>>>
>>>>>>>> I'll see if my --deploy works, and if not, I'll
be back with
>>>>>>>> another message/help request.
>>>>>>>>
>>>>>>>> When the dust settles and I'm at least minimally
functional again,
>>>>>>>> I really want to understand why all these technologies
designed to offer
>>>>>>>> redundancy conspired to reduce uptime and create failures
where there
>>>>>>>> weren't any otherwise. I thought with hosted engine,
3 ovirt servers and
>>>>>>>> glusterfs with minimum replica 2+arb or replica 3 should
have offered
>>>>>>>> strong resilience against server failure or disk failure,
and should have
>>>>>>>> prevented / recovered from data corruption. Instead, all
of the above
>>>>>>>> happened (once I get my cluster back up, I still have to
try and recover my
>>>>>>>> webserver VM, which won't boot due to XFS corrupt
journal issues created
>>>>>>>> during the gluster crashes). I think a lot of these
issues were rooted
>>>>>>>> from the upgrade from 4.1 to 4.2.
>>>>>>>>
>>>>>>>> --Jim
>>>>>>>>
>>>>>>>> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I also finally found the following in my system log
on one server:
>>>>>>>>>
>>>>>>>>> [10679.524491] INFO: task glusterclogro:14933 blocked
for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [10679.525826] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [10679.527144] glusterclogro D ffff97209832bf40
0 14933
>>>>>>>>> 1 0x00000080
>>>>>>>>> [10679.527150] Call Trace:
>>>>>>>>> [10679.527161] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [10679.527218] [<ffffffffc060e388>]
>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>> [10679.527225] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>> [10679.527254] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>> [xfs]
>>>>>>>>> [10679.527260] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>> [10679.527268] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [10679.527271] [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>>>>>> [10679.527275] [<ffffffffb992082f>]
>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>> [10679.527279] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [10679.527283] INFO: task glusterposixfsy:14941
blocked for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [10679.528608] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [10679.529956] glusterposixfsy D ffff972495f84f10
0 14941
>>>>>>>>> 1 0x00000080
>>>>>>>>> [10679.529961] Call Trace:
>>>>>>>>> [10679.529966] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [10679.530003] [<ffffffffc060e388>]
>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>> [10679.530008] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>> [10679.530038] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>> [xfs]
>>>>>>>>> [10679.530042] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>> [10679.530046] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [10679.530050] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>>>>> [10679.530054] [<ffffffffb992082f>]
>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>> [10679.530058] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [10679.530062] INFO: task glusteriotwr13:15486
blocked for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [10679.531805] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [10679.533732] glusteriotwr13 D ffff9720a83f0000
0 15486
>>>>>>>>> 1 0x00000080
>>>>>>>>> [10679.533738] Call Trace:
>>>>>>>>> [10679.533747] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [10679.533799] [<ffffffffc060e388>]
>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>> [10679.533806] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>> [10679.533846] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>> [xfs]
>>>>>>>>> [10679.533852] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>> [10679.533858] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [10679.533863] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>>>>> [10679.533868] [<ffffffffb992082f>]
>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>> [10679.533873] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [10919.512757] INFO: task glusterclogro:14933 blocked
for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [10919.514714] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [10919.516663] glusterclogro D ffff97209832bf40
0 14933
>>>>>>>>> 1 0x00000080
>>>>>>>>> [10919.516677] Call Trace:
>>>>>>>>> [10919.516690] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [10919.516696] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>>>> [10919.516703] [<ffffffffb951cc04>] ?
blk_finish_plug+0x14/0x40
>>>>>>>>> [10919.516768] [<ffffffffc05e9224>] ?
>>>>>>>>> _xfs_buf_ioapply+0x334/0x460 [xfs]
>>>>>>>>> [10919.516774] [<ffffffffb991432d>]
>>>>>>>>> wait_for_completion+0xfd/0x140
>>>>>>>>> [10919.516782] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>> [10919.516821] [<ffffffffc05eb0a3>] ?
_xfs_buf_read+0x23/0x40
>>>>>>>>> [xfs]
>>>>>>>>> [10919.516859] [<ffffffffc05eafa9>]
>>>>>>>>> xfs_buf_submit_wait+0xf9/0x1d0 [xfs]
>>>>>>>>> [10919.516902] [<ffffffffc061b279>] ?
>>>>>>>>> xfs_trans_read_buf_map+0x199/0x400 [xfs]
>>>>>>>>> [10919.516940] [<ffffffffc05eb0a3>]
_xfs_buf_read+0x23/0x40 [xfs]
>>>>>>>>> [10919.516977] [<ffffffffc05eb1b9>]
xfs_buf_read_map+0xf9/0x160
>>>>>>>>> [xfs]
>>>>>>>>> [10919.517022] [<ffffffffc061b279>]
>>>>>>>>> xfs_trans_read_buf_map+0x199/0x400 [xfs]
>>>>>>>>> [10919.517057] [<ffffffffc05c8d04>]
xfs_da_read_buf+0xd4/0x100
>>>>>>>>> [xfs]
>>>>>>>>> [10919.517091] [<ffffffffc05c8d53>]
xfs_da3_node_read+0x23/0xd0
>>>>>>>>> [xfs]
>>>>>>>>> [10919.517126] [<ffffffffc05c9fee>]
>>>>>>>>> xfs_da3_node_lookup_int+0x6e/0x2f0 [xfs]
>>>>>>>>> [10919.517160] [<ffffffffc05d5a1d>]
>>>>>>>>> xfs_dir2_node_lookup+0x4d/0x170 [xfs]
>>>>>>>>> [10919.517194] [<ffffffffc05ccf5d>]
xfs_dir_lookup+0x1bd/0x1e0
>>>>>>>>> [xfs]
>>>>>>>>> [10919.517233] [<ffffffffc05fd8d9>]
xfs_lookup+0x69/0x140 [xfs]
>>>>>>>>> [10919.517271] [<ffffffffc05fa018>]
xfs_vn_lookup+0x78/0xc0 [xfs]
>>>>>>>>> [10919.517278] [<ffffffffb9425cf3>]
lookup_real+0x23/0x60
>>>>>>>>> [10919.517283] [<ffffffffb9426702>]
__lookup_hash+0x42/0x60
>>>>>>>>> [10919.517288] [<ffffffffb942d519>]
SYSC_renameat2+0x3a9/0x5a0
>>>>>>>>> [10919.517296] [<ffffffffb94d3753>] ?
>>>>>>>>> selinux_file_free_security+0x23/0x30
>>>>>>>>> [10919.517304] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [10919.517309] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [10919.517313] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [10919.517318] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [10919.517323] [<ffffffffb942e58e>]
SyS_renameat2+0xe/0x10
>>>>>>>>> [10919.517328] [<ffffffffb942e5ce>]
SyS_rename+0x1e/0x20
>>>>>>>>> [10919.517333] [<ffffffffb992082f>]
>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>> [10919.517339] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [11159.496095] INFO: task glusteriotwr9:15482 blocked
for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [11159.497546] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [11159.498978] glusteriotwr9 D ffff971fa0fa1fa0
0 15482
>>>>>>>>> 1 0x00000080
>>>>>>>>> [11159.498984] Call Trace:
>>>>>>>>> [11159.498995] [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50
>>>>>>>>> [11159.498999] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [11159.499003] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>>>> [11159.499056] [<ffffffffc05dd9b7>] ?
>>>>>>>>> xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]
>>>>>>>>> [11159.499082] [<ffffffffc05dd43e>] ?
>>>>>>>>> xfs_iext_bno_to_irec+0x8e/0xd0 [xfs]
>>>>>>>>> [11159.499090] [<ffffffffb92f7a12>] ?
ktime_get_ts64+0x52/0xf0
>>>>>>>>> [11159.499093] [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50
>>>>>>>>> [11159.499097] [<ffffffffb991348d>]
>>>>>>>>> io_schedule_timeout+0xad/0x130
>>>>>>>>> [11159.499101] [<ffffffffb9913528>]
io_schedule+0x18/0x20
>>>>>>>>> [11159.499104] [<ffffffffb9911f11>]
bit_wait_io+0x11/0x50
>>>>>>>>> [11159.499107] [<ffffffffb9911ac1>]
__wait_on_bit_lock+0x61/0xc0
>>>>>>>>> [11159.499113] [<ffffffffb9393634>]
__lock_page+0x74/0x90
>>>>>>>>> [11159.499118] [<ffffffffb92bc210>] ?
wake_bit_function+0x40/0x40
>>>>>>>>> [11159.499121] [<ffffffffb9394154>]
__find_lock_page+0x54/0x70
>>>>>>>>> [11159.499125] [<ffffffffb9394e85>]
>>>>>>>>> grab_cache_page_write_begin+0x55/0xc0
>>>>>>>>> [11159.499130] [<ffffffffb9484b76>]
iomap_write_begin+0x66/0x100
>>>>>>>>> [11159.499135] [<ffffffffb9484edf>]
iomap_write_actor+0xcf/0x1d0
>>>>>>>>> [11159.499140] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>>>>>> [11159.499144] [<ffffffffb94854e7>]
iomap_apply+0xb7/0x150
>>>>>>>>> [11159.499149] [<ffffffffb9485621>]
>>>>>>>>> iomap_file_buffered_write+0xa1/0xe0
>>>>>>>>> [11159.499153] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>>>>>> [11159.499182] [<ffffffffc05f025d>]
>>>>>>>>> xfs_file_buffered_aio_write+0x12d/0x2c0 [xfs]
>>>>>>>>> [11159.499213] [<ffffffffc05f057d>]
>>>>>>>>> xfs_file_aio_write+0x18d/0x1b0 [xfs]
>>>>>>>>> [11159.499217] [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>>>>>> [11159.499222] [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0
>>>>>>>>> [11159.499225] [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>>>>>> [11159.499230] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [11159.499234] [<ffffffffb992082f>]
>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>> [11159.499238] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [11279.488720] INFO: task xfsaild/dm-10:1134 blocked
for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [11279.490197] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [11279.491665] xfsaild/dm-10 D ffff9720a8660fd0
0 1134
>>>>>>>>> 2 0x00000000
>>>>>>>>> [11279.491671] Call Trace:
>>>>>>>>> [11279.491682] [<ffffffffb92a3a2e>] ?
>>>>>>>>> try_to_del_timer_sync+0x5e/0x90
>>>>>>>>> [11279.491688] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [11279.491744] [<ffffffffc060de36>]
_xfs_log_force+0x1c6/0x2c0
>>>>>>>>> [xfs]
>>>>>>>>> [11279.491750] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>> [11279.491783] [<ffffffffc0619fec>] ?
xfsaild+0x16c/0x6f0 [xfs]
>>>>>>>>> [11279.491817] [<ffffffffc060df5c>]
xfs_log_force+0x2c/0x70 [xfs]
>>>>>>>>> [11279.491849] [<ffffffffc0619e80>] ?
>>>>>>>>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>>>>>>>>> [11279.491880] [<ffffffffc0619fec>]
xfsaild+0x16c/0x6f0 [xfs]
>>>>>>>>> [11279.491913] [<ffffffffc0619e80>] ?
>>>>>>>>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>>>>>>>>> [11279.491919] [<ffffffffb92bb161>]
kthread+0xd1/0xe0
>>>>>>>>> [11279.491926] [<ffffffffb92bb090>] ?
>>>>>>>>> insert_kthread_work+0x40/0x40
>>>>>>>>> [11279.491932] [<ffffffffb9920677>]
>>>>>>>>> ret_from_fork_nospec_begin+0x21/0x21
>>>>>>>>> [11279.491936] [<ffffffffb92bb090>] ?
>>>>>>>>> insert_kthread_work+0x40/0x40
>>>>>>>>> [11279.491976] INFO: task glusterclogfsyn:14934
blocked for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [11279.493466] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [11279.494952] glusterclogfsyn D ffff97209832af70
0 14934
>>>>>>>>> 1 0x00000080
>>>>>>>>> [11279.494957] Call Trace:
>>>>>>>>> [11279.494979] [<ffffffffc0309839>] ?
>>>>>>>>> __split_and_process_bio+0x2e9/0x520 [dm_mod]
>>>>>>>>> [11279.494983] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [11279.494987] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>>>> [11279.494997] [<ffffffffc0309d98>] ?
>>>>>>>>> dm_make_request+0x128/0x1a0 [dm_mod]
>>>>>>>>> [11279.495001] [<ffffffffb991348d>]
>>>>>>>>> io_schedule_timeout+0xad/0x130
>>>>>>>>> [11279.495005] [<ffffffffb99145ad>]
>>>>>>>>> wait_for_completion_io+0xfd/0x140
>>>>>>>>> [11279.495010] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>> [11279.495016] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>>>>>> [11279.495049] [<ffffffffc06064b9>]
>>>>>>>>> xfs_blkdev_issue_flush+0x19/0x20 [xfs]
>>>>>>>>> [11279.495079] [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0
>>>>>>>>> [xfs]
>>>>>>>>> [11279.495086] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>> [11279.495090] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [11279.495094] [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>>>>>> [11279.495098] [<ffffffffb992082f>]
>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>> [11279.495102] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [11279.495105] INFO: task glusterposixfsy:14941
blocked for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [11279.496606] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [11279.498114] glusterposixfsy D ffff972495f84f10
0 14941
>>>>>>>>> 1 0x00000080
>>>>>>>>> [11279.498118] Call Trace:
>>>>>>>>> [11279.498134] [<ffffffffc0309839>] ?
>>>>>>>>> __split_and_process_bio+0x2e9/0x520 [dm_mod]
>>>>>>>>> [11279.498138] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [11279.498142] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>>>> [11279.498152] [<ffffffffc0309d98>] ?
>>>>>>>>> dm_make_request+0x128/0x1a0 [dm_mod]
>>>>>>>>> [11279.498156] [<ffffffffb991348d>]
>>>>>>>>> io_schedule_timeout+0xad/0x130
>>>>>>>>> [11279.498160] [<ffffffffb99145ad>]
>>>>>>>>> wait_for_completion_io+0xfd/0x140
>>>>>>>>> [11279.498165] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>> [11279.498169] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>>>>>> [11279.498202] [<ffffffffc06064b9>]
>>>>>>>>> xfs_blkdev_issue_flush+0x19/0x20 [xfs]
>>>>>>>>> [11279.498231] [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0
>>>>>>>>> [xfs]
>>>>>>>>> [11279.498238] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>> [11279.498242] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [11279.498246] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>>>>> [11279.498250] [<ffffffffb992082f>]
>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>> [11279.498254] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [11279.498257] INFO: task glusteriotwr1:14950 blocked
for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [11279.499789] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [11279.501343] glusteriotwr1 D ffff97208b6daf70
0 14950
>>>>>>>>> 1 0x00000080
>>>>>>>>> [11279.501348] Call Trace:
>>>>>>>>> [11279.501353] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [11279.501390] [<ffffffffc060e388>]
>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>> [11279.501396] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>> [11279.501428] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>> [xfs]
>>>>>>>>> [11279.501432] [<ffffffffb944ef3f>]
generic_write_sync+0x4f/0x70
>>>>>>>>> [11279.501461] [<ffffffffc05f0545>]
>>>>>>>>> xfs_file_aio_write+0x155/0x1b0 [xfs]
>>>>>>>>> [11279.501466] [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>>>>>> [11279.501471] [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0
>>>>>>>>> [11279.501475] [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>>>>>> [11279.501479] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [11279.501483] [<ffffffffb992082f>]
>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>> [11279.501489] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [11279.501493] INFO: task glusteriotwr4:14953 blocked
for more
>>>>>>>>> than 120 seconds.
>>>>>>>>> [11279.503047] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [11279.504630] glusteriotwr4 D ffff972499f2bf40
0 14953
>>>>>>>>> 1 0x00000080
>>>>>>>>> [11279.504635] Call Trace:
>>>>>>>>> [11279.504640] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>> [11279.504676] [<ffffffffc060e388>]
>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>> [11279.504681] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>> [11279.504710] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>> [xfs]
>>>>>>>>> [11279.504714] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>> [11279.504718] [<ffffffffb992076f>] ?
>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>> [11279.504722] [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>>>>>> [11279.504725] [<ffffffffb992082f>]
>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>> [11279.504730] [<ffffffffb992077b>] ?
>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>> [12127.466494] perf: interrupt took too long (8263
> 8150),
>>>>>>>>> lowering kernel.perf_event_max_sample_rate to 24000
>>>>>>>>>
>>>>>>>>> --------------------
>>>>>>>>> I think this is the cause of the massive ovirt
performance issues
>>>>>>>>> irrespective of gluster volume. At the time this
happened, I was also
>>>>>>>>> ssh'ed into the host, and was doing some rpm
querry commands. I had just
>>>>>>>>> run rpm -qa |grep glusterfs (to verify what version
was actually
>>>>>>>>> installed), and that command took almost 2 minutes to
return! Normally it
>>>>>>>>> takes less than 2 seconds. That is all pure local
SSD IO, too....
>>>>>>>>>
>>>>>>>>> I'm no expert, but its my understanding that
anytime a software
>>>>>>>>> causes these kinds of issues, its a serious bug in
the software, even if
>>>>>>>>> its mis-handled exceptions. Is this correct?
>>>>>>>>>
>>>>>>>>> --Jim
>>>>>>>>>
>>>>>>>>> On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir
<jim(a)palousetech.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> I think this is the profile information for one
of the volumes
>>>>>>>>>> that lives on the SSDs and is fully operational
with no down/problem disks:
>>>>>>>>>>
>>>>>>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile
data info
>>>>>>>>>> Brick: ovirt2.nwfiber.com:/gluster/brick2/data
>>>>>>>>>> ----------------------------------------------
>>>>>>>>>> Cumulative Stats:
>>>>>>>>>> Block Size: 256b+
512b+
>>>>>>>>>> 1024b+
>>>>>>>>>> No. of Reads: 983
2696
>>>>>>>>>> 1059
>>>>>>>>>> No. of Writes: 0
1113
>>>>>>>>>> 302
>>>>>>>>>>
>>>>>>>>>> Block Size: 2048b+
4096b+
>>>>>>>>>> 8192b+
>>>>>>>>>> No. of Reads: 852
88608
>>>>>>>>>> 53526
>>>>>>>>>> No. of Writes: 522
812340
>>>>>>>>>> 76257
>>>>>>>>>>
>>>>>>>>>> Block Size: 16384b+
32768b+
>>>>>>>>>> 65536b+
>>>>>>>>>> No. of Reads: 54351
241901
>>>>>>>>>> 15024
>>>>>>>>>> No. of Writes: 21636
8656
>>>>>>>>>> 8976
>>>>>>>>>>
>>>>>>>>>> Block Size: 131072b+
>>>>>>>>>> No. of Reads: 524156
>>>>>>>>>> No. of Writes: 296071
>>>>>>>>>> %-latency Avg-latency Min-Latency
Max-Latency No. of
>>>>>>>>>> calls Fop
>>>>>>>>>> --------- ----------- -----------
-----------
>>>>>>>>>> ------------ ----
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 4189 RELEASE
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 1257 RELEASEDIR
>>>>>>>>>> 0.00 46.19 us 12.00 us 187.00
us
>>>>>>>>>> 69 FLUSH
>>>>>>>>>> 0.00 147.00 us 78.00 us 367.00
us
>>>>>>>>>> 86 REMOVEXATTR
>>>>>>>>>> 0.00 223.46 us 24.00 us 1166.00
us
>>>>>>>>>> 149 READDIR
>>>>>>>>>> 0.00 565.34 us 76.00 us 3639.00
us
>>>>>>>>>> 88 FTRUNCATE
>>>>>>>>>> 0.00 263.28 us 20.00 us 28385.00
us
>>>>>>>>>> 228 LK
>>>>>>>>>> 0.00 98.84 us 2.00 us 880.00
us
>>>>>>>>>> 1198 OPENDIR
>>>>>>>>>> 0.00 91.59 us 26.00 us 10371.00
us
>>>>>>>>>> 3853 STATFS
>>>>>>>>>> 0.00 494.14 us 17.00 us 193439.00
us
>>>>>>>>>> 1171 GETXATTR
>>>>>>>>>> 0.00 299.42 us 35.00 us 9799.00
us
>>>>>>>>>> 2044 READDIRP
>>>>>>>>>> 0.00 1965.31 us 110.00 us 382258.00
us
>>>>>>>>>> 321 XATTROP
>>>>>>>>>> 0.01 113.40 us 24.00 us 61061.00
us
>>>>>>>>>> 8134 STAT
>>>>>>>>>> 0.01 755.38 us 57.00 us 607603.00
us
>>>>>>>>>> 3196 DISCARD
>>>>>>>>>> 0.05 2690.09 us 58.00 us 2704761.00
us
>>>>>>>>>> 3206 OPEN
>>>>>>>>>> 0.10 119978.25 us 97.00 us 9406684.00
us
>>>>>>>>>> 154 SETATTR
>>>>>>>>>> 0.18 101.73 us 28.00 us 700477.00
us
>>>>>>>>>> 313379 FSTAT
>>>>>>>>>> 0.23 1059.84 us 25.00 us 2716124.00
us
>>>>>>>>>> 38255 LOOKUP
>>>>>>>>>> 0.47 1024.11 us 54.00 us 6197164.00
us
>>>>>>>>>> 81455 FXATTROP
>>>>>>>>>> 1.72 2984.00 us 15.00 us
37098954.00 us
>>>>>>>>>> 103020 FINODELK
>>>>>>>>>> 5.92 44315.32 us 51.00 us
24731536.00 us
>>>>>>>>>> 23957 FSYNC
>>>>>>>>>> 13.27 2399.78 us 25.00 us
22089540.00 us
>>>>>>>>>> 991005 READ
>>>>>>>>>> 37.00 5980.43 us 52.00 us
22099889.00 us
>>>>>>>>>> 1108976 WRITE
>>>>>>>>>> 41.04 5452.75 us 13.00 us
22102452.00 us
>>>>>>>>>> 1349053 INODELK
>>>>>>>>>>
>>>>>>>>>> Duration: 10026 seconds
>>>>>>>>>> Data Read: 80046027759 bytes
>>>>>>>>>> Data Written: 44496632320 bytes
>>>>>>>>>>
>>>>>>>>>> Interval 1 Stats:
>>>>>>>>>> Block Size: 256b+
512b+
>>>>>>>>>> 1024b+
>>>>>>>>>> No. of Reads: 983
2696
>>>>>>>>>> 1059
>>>>>>>>>> No. of Writes: 0
838
>>>>>>>>>> 185
>>>>>>>>>>
>>>>>>>>>> Block Size: 2048b+
4096b+
>>>>>>>>>> 8192b+
>>>>>>>>>> No. of Reads: 852
85856
>>>>>>>>>> 51575
>>>>>>>>>> No. of Writes: 382
705802
>>>>>>>>>> 57812
>>>>>>>>>>
>>>>>>>>>> Block Size: 16384b+
32768b+
>>>>>>>>>> 65536b+
>>>>>>>>>> No. of Reads: 52673
232093
>>>>>>>>>> 14984
>>>>>>>>>> No. of Writes: 13499
4908
>>>>>>>>>> 4242
>>>>>>>>>>
>>>>>>>>>> Block Size: 131072b+
>>>>>>>>>> No. of Reads: 460040
>>>>>>>>>> No. of Writes: 6411
>>>>>>>>>> %-latency Avg-latency Min-Latency
Max-Latency No. of
>>>>>>>>>> calls Fop
>>>>>>>>>> --------- ----------- -----------
-----------
>>>>>>>>>> ------------ ----
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 2093 RELEASE
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 1093 RELEASEDIR
>>>>>>>>>> 0.00 53.38 us 26.00 us 111.00
us
>>>>>>>>>> 16 FLUSH
>>>>>>>>>> 0.00 145.14 us 78.00 us 367.00
us
>>>>>>>>>> 71 REMOVEXATTR
>>>>>>>>>> 0.00 190.96 us 114.00 us 298.00
us
>>>>>>>>>> 71 SETATTR
>>>>>>>>>> 0.00 213.38 us 24.00 us 1145.00
us
>>>>>>>>>> 90 READDIR
>>>>>>>>>> 0.00 263.28 us 20.00 us 28385.00
us
>>>>>>>>>> 228 LK
>>>>>>>>>> 0.00 101.76 us 2.00 us 880.00
us
>>>>>>>>>> 1093 OPENDIR
>>>>>>>>>> 0.01 93.60 us 27.00 us 10371.00
us
>>>>>>>>>> 3090 STATFS
>>>>>>>>>> 0.02 537.47 us 17.00 us 193439.00
us
>>>>>>>>>> 1038 GETXATTR
>>>>>>>>>> 0.03 297.44 us 35.00 us 9799.00
us
>>>>>>>>>> 1990 READDIRP
>>>>>>>>>> 0.03 2357.28 us 110.00 us 382258.00
us
>>>>>>>>>> 253 XATTROP
>>>>>>>>>> 0.04 385.93 us 58.00 us 47593.00
us
>>>>>>>>>> 2091 OPEN
>>>>>>>>>> 0.04 114.86 us 24.00 us 61061.00
us
>>>>>>>>>> 7715 STAT
>>>>>>>>>> 0.06 444.59 us 57.00 us 333240.00
us
>>>>>>>>>> 3053 DISCARD
>>>>>>>>>> 0.42 316.24 us 25.00 us 290728.00
us
>>>>>>>>>> 29823 LOOKUP
>>>>>>>>>> 0.73 257.92 us 54.00 us 344812.00
us
>>>>>>>>>> 63296 FXATTROP
>>>>>>>>>> 1.37 98.30 us 28.00 us 67621.00
us
>>>>>>>>>> 313172 FSTAT
>>>>>>>>>> 1.58 2124.69 us 51.00 us 849200.00
us
>>>>>>>>>> 16717 FSYNC
>>>>>>>>>> 5.73 162.46 us 52.00 us 748492.00
us
>>>>>>>>>> 794079 WRITE
>>>>>>>>>> 7.19 2065.17 us 16.00 us
37098954.00 us
>>>>>>>>>> 78381 FINODELK
>>>>>>>>>> 36.44 886.32 us 25.00 us 2216436.00
us
>>>>>>>>>> 925421 READ
>>>>>>>>>> 46.30 1178.04 us 13.00 us 1700704.00
us
>>>>>>>>>> 884635 INODELK
>>>>>>>>>>
>>>>>>>>>> Duration: 7485 seconds
>>>>>>>>>> Data Read: 71250527215 bytes
>>>>>>>>>> Data Written: 5119903744 bytes
>>>>>>>>>>
>>>>>>>>>> Brick: ovirt3.nwfiber.com:/gluster/brick2/data
>>>>>>>>>> ----------------------------------------------
>>>>>>>>>> Cumulative Stats:
>>>>>>>>>> Block Size: 1b+
>>>>>>>>>> No. of Reads: 0
>>>>>>>>>> No. of Writes: 3264419
>>>>>>>>>> %-latency Avg-latency Min-Latency
Max-Latency No. of
>>>>>>>>>> calls Fop
>>>>>>>>>> --------- ----------- -----------
-----------
>>>>>>>>>> ------------ ----
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 90 FORGET
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 9462 RELEASE
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 4254 RELEASEDIR
>>>>>>>>>> 0.00 50.52 us 13.00 us 190.00
us
>>>>>>>>>> 71 FLUSH
>>>>>>>>>> 0.00 186.97 us 87.00 us 713.00
us
>>>>>>>>>> 86 REMOVEXATTR
>>>>>>>>>> 0.00 79.32 us 33.00 us 189.00
us
>>>>>>>>>> 228 LK
>>>>>>>>>> 0.00 220.98 us 129.00 us 513.00
us
>>>>>>>>>> 86 SETATTR
>>>>>>>>>> 0.01 259.30 us 26.00 us 2632.00
us
>>>>>>>>>> 137 READDIR
>>>>>>>>>> 0.02 322.76 us 145.00 us 2125.00
us
>>>>>>>>>> 321 XATTROP
>>>>>>>>>> 0.03 109.55 us 2.00 us 1258.00
us
>>>>>>>>>> 1193 OPENDIR
>>>>>>>>>> 0.05 70.21 us 21.00 us 431.00
us
>>>>>>>>>> 3196 DISCARD
>>>>>>>>>> 0.05 169.26 us 21.00 us 2315.00
us
>>>>>>>>>> 1545 GETXATTR
>>>>>>>>>> 0.12 176.85 us 63.00 us 2844.00
us
>>>>>>>>>> 3206 OPEN
>>>>>>>>>> 0.61 303.49 us 90.00 us 3085.00
us
>>>>>>>>>> 9633 FSTAT
>>>>>>>>>> 2.44 305.66 us 28.00 us 3716.00
us
>>>>>>>>>> 38230 LOOKUP
>>>>>>>>>> 4.52 266.22 us 55.00 us 53424.00
us
>>>>>>>>>> 81455 FXATTROP
>>>>>>>>>> 6.96 1397.99 us 51.00 us 64822.00
us
>>>>>>>>>> 23889 FSYNC
>>>>>>>>>> 16.48 84.74 us 25.00 us 6917.00
us
>>>>>>>>>> 932592 WRITE
>>>>>>>>>> 30.16 106.90 us 13.00 us 3920189.00
us
>>>>>>>>>> 1353046 INODELK
>>>>>>>>>> 38.55 1794.52 us 14.00 us
16210553.00 us
>>>>>>>>>> 103039 FINODELK
>>>>>>>>>>
>>>>>>>>>> Duration: 66562 seconds
>>>>>>>>>> Data Read: 0 bytes
>>>>>>>>>> Data Written: 3264419 bytes
>>>>>>>>>>
>>>>>>>>>> Interval 1 Stats:
>>>>>>>>>> Block Size: 1b+
>>>>>>>>>> No. of Reads: 0
>>>>>>>>>> No. of Writes: 794080
>>>>>>>>>> %-latency Avg-latency Min-Latency
Max-Latency No. of
>>>>>>>>>> calls Fop
>>>>>>>>>> --------- ----------- -----------
-----------
>>>>>>>>>> ------------ ----
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 2093 RELEASE
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 1093 RELEASEDIR
>>>>>>>>>> 0.00 70.31 us 26.00 us 125.00
us
>>>>>>>>>> 16 FLUSH
>>>>>>>>>> 0.00 193.10 us 103.00 us 713.00
us
>>>>>>>>>> 71 REMOVEXATTR
>>>>>>>>>> 0.01 227.32 us 133.00 us 513.00
us
>>>>>>>>>> 71 SETATTR
>>>>>>>>>> 0.01 79.32 us 33.00 us 189.00
us
>>>>>>>>>> 228 LK
>>>>>>>>>> 0.01 259.83 us 35.00 us 1138.00
us
>>>>>>>>>> 89 READDIR
>>>>>>>>>> 0.03 318.26 us 145.00 us 2047.00
us
>>>>>>>>>> 253 XATTROP
>>>>>>>>>> 0.04 112.67 us 3.00 us 1258.00
us
>>>>>>>>>> 1093 OPENDIR
>>>>>>>>>> 0.06 167.98 us 23.00 us 1951.00
us
>>>>>>>>>> 1014 GETXATTR
>>>>>>>>>> 0.08 70.97 us 22.00 us 431.00
us
>>>>>>>>>> 3053 DISCARD
>>>>>>>>>> 0.13 183.78 us 66.00 us 2844.00
us
>>>>>>>>>> 2091 OPEN
>>>>>>>>>> 1.01 303.82 us 90.00 us 3085.00
us
>>>>>>>>>> 9610 FSTAT
>>>>>>>>>> 3.27 316.59 us 30.00 us 3716.00
us
>>>>>>>>>> 29820 LOOKUP
>>>>>>>>>> 5.83 265.79 us 59.00 us 53424.00
us
>>>>>>>>>> 63296 FXATTROP
>>>>>>>>>> 7.95 1373.89 us 51.00 us 64822.00
us
>>>>>>>>>> 16717 FSYNC
>>>>>>>>>> 23.17 851.99 us 14.00 us
16210553.00 us
>>>>>>>>>> 78555 FINODELK
>>>>>>>>>> 24.04 87.44 us 27.00 us 6917.00
us
>>>>>>>>>> 794081 WRITE
>>>>>>>>>> 34.36 111.91 us 14.00 us 984871.00
us
>>>>>>>>>> 886790 INODELK
>>>>>>>>>>
>>>>>>>>>> Duration: 7485 seconds
>>>>>>>>>> Data Read: 0 bytes
>>>>>>>>>> Data Written: 794080 bytes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----------------------
>>>>>>>>>> Here is the data from the volume that is backed
by the SHDDs and
>>>>>>>>>> has one failed disk:
>>>>>>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile
data-hdd info
>>>>>>>>>> Brick: 172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>>>> --------------------------------------------
>>>>>>>>>> Cumulative Stats:
>>>>>>>>>> Block Size: 256b+
512b+
>>>>>>>>>> 1024b+
>>>>>>>>>> No. of Reads: 1702
86
>>>>>>>>>> 16
>>>>>>>>>> No. of Writes: 0
767
>>>>>>>>>> 71
>>>>>>>>>>
>>>>>>>>>> Block Size: 2048b+
4096b+
>>>>>>>>>> 8192b+
>>>>>>>>>> No. of Reads: 19
51841
>>>>>>>>>> 2049
>>>>>>>>>> No. of Writes: 76
60668
>>>>>>>>>> 35727
>>>>>>>>>>
>>>>>>>>>> Block Size: 16384b+
32768b+
>>>>>>>>>> 65536b+
>>>>>>>>>> No. of Reads: 1744
639
>>>>>>>>>> 1088
>>>>>>>>>> No. of Writes: 8524
2410
>>>>>>>>>> 1285
>>>>>>>>>>
>>>>>>>>>> Block Size: 131072b+
>>>>>>>>>> No. of Reads: 771999
>>>>>>>>>> No. of Writes: 29584
>>>>>>>>>> %-latency Avg-latency Min-Latency
Max-Latency No. of
>>>>>>>>>> calls Fop
>>>>>>>>>> --------- ----------- -----------
-----------
>>>>>>>>>> ------------ ----
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 2902 RELEASE
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 1517 RELEASEDIR
>>>>>>>>>> 0.00 197.00 us 197.00 us 197.00
us
>>>>>>>>>> 1 FTRUNCATE
>>>>>>>>>> 0.00 70.24 us 16.00 us 758.00
us
>>>>>>>>>> 51 FLUSH
>>>>>>>>>> 0.00 143.93 us 82.00 us 305.00
us
>>>>>>>>>> 57 REMOVEXATTR
>>>>>>>>>> 0.00 178.63 us 105.00 us 712.00
us
>>>>>>>>>> 60 SETATTR
>>>>>>>>>> 0.00 67.30 us 19.00 us 572.00
us
>>>>>>>>>> 555 LK
>>>>>>>>>> 0.00 322.80 us 23.00 us 4673.00
us
>>>>>>>>>> 138 READDIR
>>>>>>>>>> 0.00 336.56 us 106.00 us 11994.00
us
>>>>>>>>>> 237 XATTROP
>>>>>>>>>> 0.00 84.70 us 28.00 us 1071.00
us
>>>>>>>>>> 3469 STATFS
>>>>>>>>>> 0.01 387.75 us 2.00 us 146017.00
us
>>>>>>>>>> 1467 OPENDIR
>>>>>>>>>> 0.01 148.59 us 21.00 us 64374.00
us
>>>>>>>>>> 4454 STAT
>>>>>>>>>> 0.02 783.02 us 16.00 us 93502.00
us
>>>>>>>>>> 1902 GETXATTR
>>>>>>>>>> 0.03 1516.10 us 17.00 us 210690.00
us
>>>>>>>>>> 1364 ENTRYLK
>>>>>>>>>> 0.03 2555.47 us 300.00 us 674454.00
us
>>>>>>>>>> 1064 READDIRP
>>>>>>>>>> 0.07 85.74 us 19.00 us 68340.00
us
>>>>>>>>>> 62849 FSTAT
>>>>>>>>>> 0.07 1978.12 us 59.00 us 202596.00
us
>>>>>>>>>> 2729 OPEN
>>>>>>>>>> 0.22 708.57 us 15.00 us 394799.00
us
>>>>>>>>>> 25447 LOOKUP
>>>>>>>>>> 5.94 2331.74 us 15.00 us 1099530.00
us
>>>>>>>>>> 207534 FINODELK
>>>>>>>>>> 7.31 8311.75 us 58.00 us 1800216.00
us
>>>>>>>>>> 71668 FXATTROP
>>>>>>>>>> 12.49 7735.19 us 51.00 us 3595513.00
us
>>>>>>>>>> 131642 WRITE
>>>>>>>>>> 17.70 957.08 us 16.00 us
13700466.00 us
>>>>>>>>>> 1508160 INODELK
>>>>>>>>>> 24.55 2546.43 us 26.00 us 5077347.00
us
>>>>>>>>>> 786060 READ
>>>>>>>>>> 31.56 49699.15 us 47.00 us 3746331.00
us
>>>>>>>>>> 51777 FSYNC
>>>>>>>>>>
>>>>>>>>>> Duration: 10101 seconds
>>>>>>>>>> Data Read: 101562897361 bytes
>>>>>>>>>> Data Written: 4834450432 bytes
>>>>>>>>>>
>>>>>>>>>> Interval 0 Stats:
>>>>>>>>>> Block Size: 256b+
512b+
>>>>>>>>>> 1024b+
>>>>>>>>>> No. of Reads: 1702
86
>>>>>>>>>> 16
>>>>>>>>>> No. of Writes: 0
767
>>>>>>>>>> 71
>>>>>>>>>>
>>>>>>>>>> Block Size: 2048b+
4096b+
>>>>>>>>>> 8192b+
>>>>>>>>>> No. of Reads: 19
51841
>>>>>>>>>> 2049
>>>>>>>>>> No. of Writes: 76
60668
>>>>>>>>>> 35727
>>>>>>>>>>
>>>>>>>>>> Block Size: 16384b+
32768b+
>>>>>>>>>> 65536b+
>>>>>>>>>> No. of Reads: 1744
639
>>>>>>>>>> 1088
>>>>>>>>>> No. of Writes: 8524
2410
>>>>>>>>>> 1285
>>>>>>>>>>
>>>>>>>>>> Block Size: 131072b+
>>>>>>>>>> No. of Reads: 771999
>>>>>>>>>> No. of Writes: 29584
>>>>>>>>>> %-latency Avg-latency Min-Latency
Max-Latency No. of
>>>>>>>>>> calls Fop
>>>>>>>>>> --------- ----------- -----------
-----------
>>>>>>>>>> ------------ ----
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 2902 RELEASE
>>>>>>>>>> 0.00 0.00 us 0.00 us 0.00
us
>>>>>>>>>> 1517 RELEASEDIR
>>>>>>>>>> 0.00 197.00 us 197.00 us 197.00
us
>>>>>>>>>> 1 FTRUNCATE
>>>>>>>>>> 0.00 70.24 us 16.00 us 758.00
us
>>>>>>>>>> 51 FLUSH
>>>>>>>>>> 0.00 143.93 us 82.00 us 305.00
us
>>>>>>>>>> 57 REMOVEXATTR
>>>>>>>>>> 0.00 178.63 us 105.00 us 712.00
us
>>>>>>>>>> 60 SETATTR
>>>>>>>>>> 0.00 67.30 us 19.00 us 572.00
us
>>>>>>>>>> 555 LK
>>>>>>>>>> 0.00 322.80 us 23.00 us 4673.00
us
>>>>>>>>>> 138 READDIR
>>>>>>>>>> 0.00 336.56 us 106.00 us 11994.00
us
>>>>>>>>>> 237 XATTROP
>>>>>>>>>> 0.00 84.70 us 28.00 us 1071.00
us
>>>>>>>>>> 3469 STATFS
>>>>>>>>>> 0.01 387.75 us 2.00 us 146017.00
us
>>>>>>>>>> 1467 OPENDIR
>>>>>>>>>> 0.01 148.59 us 21.00 us 64374.00
us
>>>>>>>>>> 4454 STAT
>>>>>>>>>> 0.02 783.02 us 16.00 us 93502.00
us
>>>>>>>>>> 1902 GETXATTR
>>>>>>>>>> 0.03 1516.10 us 17.00 us 210690.00
us
>>>>>>>>>> 1364 ENTRYLK
>>>>>>>>>> 0.03 2555.47 us 300.00 us 674454.00
us
>>>>>>>>>> 1064 READDIRP
>>>>>>>>>> 0.07 85.73 us 19.00 us 68340.00
us
>>>>>>>>>> 62849 FSTAT
>>>>>>>>>> 0.07 1978.12 us 59.00 us 202596.00
us
>>>>>>>>>> 2729 OPEN
>>>>>>>>>> 0.22 708.57 us 15.00 us 394799.00
us
>>>>>>>>>> 25447 LOOKUP
>>>>>>>>>> 5.94 2334.57 us 15.00 us 1099530.00
us
>>>>>>>>>> 207534 FINODELK
>>>>>>>>>> 7.31 8311.49 us 58.00 us 1800216.00
us
>>>>>>>>>> 71668 FXATTROP
>>>>>>>>>> 12.49 7735.32 us 51.00 us 3595513.00
us
>>>>>>>>>> 131642 WRITE
>>>>>>>>>> 17.71 957.08 us 16.00 us
13700466.00 us
>>>>>>>>>> 1508160 INODELK
>>>>>>>>>> 24.56 2546.42 us 26.00 us 5077347.00
us
>>>>>>>>>> 786060 READ
>>>>>>>>>> 31.54 49651.63 us 47.00 us 3746331.00
us
>>>>>>>>>> 51777 FSYNC
>>>>>>>>>>
>>>>>>>>>> Duration: 10101 seconds
>>>>>>>>>> Data Read: 101562897361 bytes
>>>>>>>>>> Data Written: 4834450432 bytes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir
<
>>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you for your response.
>>>>>>>>>>>
>>>>>>>>>>> I have 4 gluster volumes. 3 are replica 2 +
arbitrator.
>>>>>>>>>>> replica bricks are on ovirt1 and ovirt2,
arbitrator on ovirt3. The 4th
>>>>>>>>>>> volume is replica 3, with a brick on all
three ovirt machines.
>>>>>>>>>>>
>>>>>>>>>>> The first 3 volumes are on an SSD disk; the
4th is on a Seagate
>>>>>>>>>>> SSHD (same in all three machines). On
ovirt3, the SSHD has reported hard
>>>>>>>>>>> IO failures, and that brick is offline.
However, the other two replicas
>>>>>>>>>>> are fully operational (although they still
show contents in the heal info
>>>>>>>>>>> command that won't go away, but that may
be the case until I replace the
>>>>>>>>>>> failed disk).
>>>>>>>>>>>
>>>>>>>>>>> What is bothering me is that ALL 4 gluster
volumes are showing
>>>>>>>>>>> horrible performance issues. At this point,
as the bad disk has been
>>>>>>>>>>> completely offlined, I would expect gluster
to perform at normal speed, but
>>>>>>>>>>> that is definitely not the case.
>>>>>>>>>>>
>>>>>>>>>>> I've also noticed that the performance
hits seem to come in
>>>>>>>>>>> waves: things seem to work acceptably (but
slow) for a while, then
>>>>>>>>>>> suddenly, its as if all disk IO on all
volumes (including non-gluster local
>>>>>>>>>>> OS disk volumes for the hosts) pause for
about 30 seconds, then IO resumes
>>>>>>>>>>> again. During those times, I start getting
VM not responding and host not
>>>>>>>>>>> responding notices as well as the
applications having major issues.
>>>>>>>>>>>
>>>>>>>>>>> I've shut down most of my VMs and am down
to just my essential
>>>>>>>>>>> core VMs (shedded about 75% of my VMs). I
still am experiencing the same
>>>>>>>>>>> issues.
>>>>>>>>>>>
>>>>>>>>>>> Am I correct in believing that once the
failed disk was brought
>>>>>>>>>>> offline that performance should return to
normal?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 29, 2018 at 1:27 PM, Alex K <
>>>>>>>>>>> rightkicktech(a)gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I would check disks status and
accessibility of mount points
>>>>>>>>>>>> where your gluster volumes reside.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 29, 2018, 22:28 Jim Kusznir
<jim(a)palousetech.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On one ovirt server, I'm now
seeing these messages:
>>>>>>>>>>>>> [56474.239725] blk_update_request: 63
callbacks suppressed
>>>>>>>>>>>>> [56474.239732] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 0
>>>>>>>>>>>>> [56474.240602] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 3905945472
>>>>>>>>>>>>> [56474.241346] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 3905945584
>>>>>>>>>>>>> [56474.242236] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 2048
>>>>>>>>>>>>> [56474.243072] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 3905943424
>>>>>>>>>>>>> [56474.243997] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 3905943536
>>>>>>>>>>>>> [56474.247347] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 0
>>>>>>>>>>>>> [56474.248315] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 3905945472
>>>>>>>>>>>>> [56474.249231] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 3905945584
>>>>>>>>>>>>> [56474.250221] blk_update_request:
I/O error, dev dm-2,
>>>>>>>>>>>>> sector 2048
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 29, 2018 at 11:59 AM, Jim
Kusznir <
>>>>>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see in messages on ovirt3 (my
3rd machine, the one
>>>>>>>>>>>>>> upgraded to 4.2):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> May 29 11:54:41 ovirt3
ovs-vsctl:
>>>>>>>>>>>>>>
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>>>>> database connection failed (No
such file or directory)
>>>>>>>>>>>>>> May 29 11:54:51 ovirt3
ovs-vsctl:
>>>>>>>>>>>>>>
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>>>>> database connection failed (No
such file or directory)
>>>>>>>>>>>>>> May 29 11:55:01 ovirt3
ovs-vsctl:
>>>>>>>>>>>>>>
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>>>>> database connection failed (No
such file or directory)
>>>>>>>>>>>>>> (appears a lot).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also found on the ssh session
of that, some sysv warnings
>>>>>>>>>>>>>> about the backing disk for one of
the gluster volumes (straight replica
>>>>>>>>>>>>>> 3). The glusterfs process for
that disk on that machine went offline. Its
>>>>>>>>>>>>>> my understanding that it should
continue to work with the other two
>>>>>>>>>>>>>> machines while I attempt to
replace that disk, right? Attempted writes
>>>>>>>>>>>>>> (touching an empty file) can take
15 seconds, repeating it later will be
>>>>>>>>>>>>>> much faster.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Gluster generates a bunch of
different log files, I don't
>>>>>>>>>>>>>> know what ones you want, or from
which machine(s).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> How do I do "volume
profiling"?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, May 29, 2018 at 11:53 AM,
Sahina Bose <
>>>>>>>>>>>>>> sabose(a)redhat.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you see errors reported in
the mount logs for the
>>>>>>>>>>>>>>> volume? If so, could you
attach the logs?
>>>>>>>>>>>>>>> Any issues with your
underlying disks. Can you also attach
>>>>>>>>>>>>>>> output of volume profiling?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, May 30, 2018 at 12:13
AM, Jim Kusznir <
>>>>>>>>>>>>>>> jim(a)palousetech.com>
wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ok, things have gotten
MUCH worse this morning. I'm
>>>>>>>>>>>>>>>> getting random errors
from VMs, right now, about a third of my VMs have
>>>>>>>>>>>>>>>> been paused due to
storage issues, and most of the remaining VMs are not
>>>>>>>>>>>>>>>> performing well.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> At this point, I am in
full EMERGENCY mode, as my
>>>>>>>>>>>>>>>> production services are
now impacted, and I'm getting calls coming in with
>>>>>>>>>>>>>>>> problems...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd greatly
appreciate help...VMs are running VERY slowly
>>>>>>>>>>>>>>>> (when they run), and they
are steadily getting worse. I don't know why. I
>>>>>>>>>>>>>>>> was seeing CPU peaks (to
100%) on several VMs, in perfect sync, for a few
>>>>>>>>>>>>>>>> minutes at a time (while
the VM became unresponsive and any VMs I was
>>>>>>>>>>>>>>>> logged into that were
linux were giving me the CPU stuck messages in my
>>>>>>>>>>>>>>>> origional post). Is all
this storage related?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I also have two different
gluster volumes for VM storage,
>>>>>>>>>>>>>>>> and only one had the
issues, but now VMs in both are being affected at the
>>>>>>>>>>>>>>>> same time and same way.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, May 28, 2018 at
10:50 PM, Sahina Bose <
>>>>>>>>>>>>>>>> sabose(a)redhat.com>
wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [Adding gluster-users
to look at the heal issue]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, May 29, 2018
at 9:17 AM, Jim Kusznir <
>>>>>>>>>>>>>>>>>
jim(a)palousetech.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've been
having some cluster and gluster performance
>>>>>>>>>>>>>>>>>> issues lately. I
also found that my cluster was out of date, and was
>>>>>>>>>>>>>>>>>> trying to apply
updates (hoping to fix some of these), and discovered the
>>>>>>>>>>>>>>>>>> ovirt 4.1 repos
were taken completely offline. So, I was forced to begin
>>>>>>>>>>>>>>>>>> an upgrade to
4.2. According to docs I found/read, I needed only add the
>>>>>>>>>>>>>>>>>> new repo, do a
yum update, reboot, and be good on my hosts (did the yum
>>>>>>>>>>>>>>>>>> update, the
engine-setup on my hosted engine). Things seemed to work
>>>>>>>>>>>>>>>>>> relatively well,
except for a gluster sync issue that showed up.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My cluster is a 3
node hyperconverged cluster. I
>>>>>>>>>>>>>>>>>> upgraded the
hosted engine first, then engine 3. When engine 3 came back
>>>>>>>>>>>>>>>>>> up, for some
reason one of my gluster volumes would not sync. Here's
>>>>>>>>>>>>>>>>>> sample output:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [root@ovirt3 ~]#
gluster volume heal data-hdd info
>>>>>>>>>>>>>>>>>> Brick
172.172.1.11:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0
>>>>>>>>>>>>>>>>>>
-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300
>>>>>>>>>>>>>>>>>>
-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f
>>>>>>>>>>>>>>>>>>
-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470
>>>>>>>>>>>>>>>>>>
-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad
>>>>>>>>>>>>>>>>>>
-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a
>>>>>>>>>>>>>>>>>>
-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231
>>>>>>>>>>>>>>>>>>
-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69
>>>>>>>>>>>>>>>>>>
-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>>>> Status:
Connected
>>>>>>>>>>>>>>>>>> Number of
entries: 8
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Brick
172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0
>>>>>>>>>>>>>>>>>>
-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300
>>>>>>>>>>>>>>>>>>
-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a
>>>>>>>>>>>>>>>>>>
-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f
>>>>>>>>>>>>>>>>>>
-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470
>>>>>>>>>>>>>>>>>>
-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231
>>>>>>>>>>>>>>>>>>
-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69
>>>>>>>>>>>>>>>>>>
-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad
>>>>>>>>>>>>>>>>>>
-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>>>> Status:
Connected
>>>>>>>>>>>>>>>>>> Number of
entries: 8
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Brick
172.172.1.13:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a
>>>>>>>>>>>>>>>>>>
-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231
>>>>>>>>>>>>>>>>>>
-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69
>>>>>>>>>>>>>>>>>>
-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad
>>>>>>>>>>>>>>>>>>
-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300
>>>>>>>>>>>>>>>>>>
-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0
>>>>>>>>>>>>>>>>>>
-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f
>>>>>>>>>>>>>>>>>>
-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470
>>>>>>>>>>>>>>>>>>
-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>>>> Status:
Connected
>>>>>>>>>>>>>>>>>> Number of
entries: 8
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ---------
>>>>>>>>>>>>>>>>>> Its been in this
state for a couple days now, and
>>>>>>>>>>>>>>>>>> bandwidth
monitoring shows no appreciable data moving. I've tried
>>>>>>>>>>>>>>>>>> repeatedly
commanding a full heal from all three clusters in the node. Its
>>>>>>>>>>>>>>>>>> always the same
files that need healing.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When running
gluster volume heal data-hdd statistics, I
>>>>>>>>>>>>>>>>>> see sometimes
different information, but always some number of "heal
>>>>>>>>>>>>>>>>>> failed"
entries. It shows 0 for split brain.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm not quite
sure what to do. I suspect it may be due
>>>>>>>>>>>>>>>>>> to nodes 1 and 2
still being on the older ovirt/gluster release, but I'm
>>>>>>>>>>>>>>>>>> afraid to upgrade
and reboot them until I have a good gluster sync (don't
>>>>>>>>>>>>>>>>>> need to create a
split brain issue). How do I proceed with this?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Second issue:
I've been experiencing VERY POOR
>>>>>>>>>>>>>>>>>> performance on
most of my VMs. To the tune that logging into a windows 10
>>>>>>>>>>>>>>>>>> vm via remote
desktop can take 5 minutes, launching quickbooks inside said
>>>>>>>>>>>>>>>>>> vm can easily
take 10 minutes. On some linux VMs, I get random messages
>>>>>>>>>>>>>>>>>> like this:
>>>>>>>>>>>>>>>>>> Message from
syslogd@unifi at May 28 20:39:23 ...
>>>>>>>>>>>>>>>>>>
kernel:[6171996.308904] NMI watchdog: BUG: soft lockup
>>>>>>>>>>>>>>>>>> - CPU#0 stuck for
22s! [mongod:14766]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (the process and
PID are often different)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm not quite
sure what to do about this either. My
>>>>>>>>>>>>>>>>>> initial thought
was upgrad everything to current and see if its still
>>>>>>>>>>>>>>>>>> there, but I
cannot move forward with that until my gluster is healed...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>>>>>> Users mailing
list -- users(a)ovirt.org
>>>>>>>>>>>>>>>>>> To unsubscribe
send an email to users-leave(a)ovirt.org
>>>>>>>>>>>>>>>>>> Privacy
Statement:
https://www.ovirt.org/site/pri
>>>>>>>>>>>>>>>>>> vacy-policy/
>>>>>>>>>>>>>>>>>> oVirt Code of
Conduct:
https://www.ovirt.org/communit
>>>>>>>>>>>>>>>>>>
y/about/community-guidelines/
>>>>>>>>>>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>>>>>>>>>>
es/list/users(a)ovirt.org/messag
>>>>>>>>>>>>>>>>>>
e/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>> Users mailing list --
users(a)ovirt.org
>>>>>>>>>>>>> To unsubscribe send an email to
users-leave(a)ovirt.org
>>>>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>>>>>> y/about/community-guidelines/
>>>>>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>>>>>
es/list/users(a)ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDE
>>>>>>>>>>>>> S5MF/
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list -- users(a)ovirt.org
>>>>>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>> y/about/community-guidelines/
>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>
es/list/users(a)ovirt.org/message/JKBSLSZFRMFGILXGZB4YAMJIJKEKNGMA/
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>