[ovirt-users] Re: Gluster problems, cluster performance issues

Wednesday, 30 May 2018

Ok, finally fixed my ovirt engine, and its trying to run...but it turns out
that its seeing 60-90% iowait, and is about to be killed off by the ha
agent because it can't get fully online in time.  I did make the changes
recommended about locks, and it doesn't seem to have helped things any.

The 3rd server (with the bad brick) is powered off at this point, so I have
two servers, each with replicas of all volumes, and the 3rd server not even
responding to pings...

--Jim

On Wed, May 30, 2018 at 1:53 AM, Jim Kusznir <jim(a)palousetech.com&gt; wrote:

...
 Ok, made it through this one finally.  Turns out there's a little
log file
 "sanlock.log" that had the magic info in it...the ...dom_md/ids file got
 messed with in the gluster crash, had to recover that from a backup, then
 run the hosted-engine --reinitialize-lockspace --force, and then it started
 working again.  Now, for some reason, my hosted engine starts up, but has
 no networking and the hosted-engine --console doesn't work either...Still
 looking into this now.

 On Wed, May 30, 2018 at 1:29 AM, Jim Kusznir <jim(a)palousetech.com&gt; wrote:

> So, right now I think this message from vdsm.log captures my blocking
> problem:
>
> 2018-05-30 01:27:22,326-0700 INFO  (jsonrpc/5) [vdsm.api] FINISH
> repoStats return={u'c0acdefb-7d16-48ec-9d76-659b8fe33e2a': {'code':
0,
> 'actual': True, 'version': 4, 'acquired': False,
'delay': '0.00142076',
> 'lastCheck': '1.1', 'valid': True}} from=::1,60684,
> task_id=7da641b6-9a98-4db7-aafb-04ff84228cf5 (api:52)
> 2018-05-30 01:27:22,326-0700 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer]
> RPC call Host.getStorageRepoStats succeeded in 0.00 seconds (__init__:573)
> 2018-05-30 01:27:23,411-0700 ERROR (vm/50392390) [virt.vm]
> (vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') The vm start process
> failed (vm:943)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in
> _startUnderlyingVm
>     self._run()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in
> _run
>     dom.createWithFlags(flags)
>   File
"/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py",
> line 130, in wrapper
>     ret = f(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line
> 92, in wrapper
>     return func(inst, *args, **kwargs)
>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in
> createWithFlags
>     if ret == -1: raise libvirtError ('virDomainCreateWithFlags()
> failed', dom=self)
> libvirtError: Failed to acquire lock: No space left on device
> 2018-05-30 01:27:23,411-0700 INFO  (vm/50392390) [virt.vm]
> (vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') Changed state to Down:
> Failed to acquire lock: No space left on device (code=1) (vm:1683)
>
> The strange thing is I can't figure out what device has no space left.
> The only hits on google talk about deleting __DIRECT_IO_TEST__ file in the
> storage repo, which I tried, but it did not help.  Perhaps one of the files
> in the engine volume stores locks, and that is full/corrupt?
>
> --Jim
>
> On Wed, May 30, 2018 at 12:25 AM, Jim Kusznir <jim(a)palousetech.com&gt;
> wrote:
>
>> I tried to capture a consistent log slice accros the cluster from
>> agent.log.  Several systems claim they are starting the engine, but virsh
>> never shows any sign of it.  They also claim that a uuid that I'm pretty
>> sure is not affiliated with my engine is not found.
>>
>> ---------
>> ovirt1:
>> ---------
>> MainThread::INFO::2018-05-30 00:18:11,628::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineForceStop-EngineDown) sent? sent
>> [root@ovirt1 vdsm]# tail -n 20 ../ovirt-hosted-engine-ha/agent.log
>> MainThread::INFO::2018-05-30 00:14:02,154::hosted_engine::4
>> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineStarting (score: 3400)
>> MainThread::INFO::2018-05-30 00:14:02,173::states::776::ovi
>> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Another
>> host already took over..
>> MainThread::INFO::2018-05-30 00:14:02,174::state_decorators
>> ::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Timeout cleared while transitioning <class
'ovirt_hosted_engine_ha.agent.states.EngineStarting'>
>> -> <class
'ovirt_hosted_engine_ha.agent.states.EngineForceStop'>
>> MainThread::INFO::2018-05-30 00:14:02,533::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineStarting-EngineForceStop) sent?
>> sent
>> MainThread::INFO::2018-05-30 00:14:02,843::ovf_store::151::
>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
>> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7
>> 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
>> MainThread::INFO::2018-05-30 00:18:09,236::hosted_engine::4
>> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineForceStop (score: 3400)
>> MainThread::INFO::2018-05-30 00:18:09,236::hosted_engine::4
>> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Best remote host ovirt3.nwfiber.com (id: 3, score: 3400)
>> MainThread::INFO::2018-05-30 00:18:09,253::hosted_engine::1007::
>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
>> Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff`
>> MainThread::INFO::2018-05-30 00:18:11,159::hosted_engine::1012::
>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
>> stdout:
>> MainThread::INFO::2018-05-30 00:18:11,160::hosted_engine::1013::
>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
>> stderr:
>> MainThread::ERROR::2018-05-30 00:18:11,160::hosted_engine::1021::
>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
>> Engine VM stopped on localhost
>> MainThread::INFO::2018-05-30 00:18:11,628::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineForceStop-EngineDown) sent? sent
>> MainThread::INFO::2018-05-30 00:22:17,692::hosted_engine::4
>> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineDown (score: 3400)
>> MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1
>> 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> Global metadata: {'maintenance': False}
>> MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1
>> 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> Host ovirt2.nwfiber.com (id 2): {'conf_on_shared_storage': True,
>> 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4348
>> (Wed May 30 00:16:46 2018)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=4349
>> (Wed May 30 00:16:47 2018)\nconf_on_shared_storage=
>> True\nmaintenance=False\nstate=EngineDown\nstopped=False\n',
>> 'hostname': 'ovirt2.nwfiber.com', 'alive': True,
'host-id': 2,
>> 'engine-status': {'reason': 'bad vm status',
'health': 'bad', 'vm':
>> 'down_unexpected', 'detail': 'Down'}, 'score':
3400, 'stopped': False,
>> 'maintenance': False, 'crc32': '5ee0a557',
'local_conf_timestamp': 4349,
>> 'host-ts': 4348}
>> MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1
>> 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> Host ovirt3.nwfiber.com (id 3): {'conf_on_shared_storage': True,
>> 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=100036
>> (Wed May 30 00:16:58 2018)\nhost-id=3\nscore=3400\nvm_conf_refresh_time=100037
>> (Wed May 30 00:16:58 2018)\nconf_on_shared_storage=
>> True\nmaintenance=False\nstate=EngineStart\nstopped=False\n',
>> 'hostname': 'ovirt3.nwfiber.com', 'alive': True,
'host-id': 3,
>> 'engine-status': {'reason': 'bad vm status',
'health': 'bad', 'vm':
>> 'down_unexpected', 'detail': 'Down'}, 'score':
3400, 'stopped': False,
>> 'maintenance': False, 'crc32': 'e2fadf31',
'local_conf_timestamp': 100037,
>> 'host-ts': 100036}
>> MainThread::INFO::2018-05-30 00:22:17,713::state_machine::1
>> 77::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> Local (id 1): {'engine-health': {'reason': 'vm not running on
this host',
>> 'health': 'bad', 'vm': 'down', 'detail':
'unknown'}, 'bridge': True,
>> 'mem-free': 30186.0, 'maintenance': False, 'cpu-load':
0.0308, 'gateway':
>> 1.0, 'storage-domain': True}
>> MainThread::INFO::2018-05-30 00:22:17,714::states::508::ovi
>> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
>> down and local host has best score (3400), attempting to start engine VM
>> MainThread::INFO::2018-05-30 00:22:18,309::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineDown-EngineStart) sent? sent
>> MainThread::INFO::2018-05-30 00:22:18,624::ovf_store::151::
>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
>> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7
>> 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
>>
>> ----------
>> ovirt2
>> ----------
>> MainThread::INFO::2018-05-30 00:08:34,534::hosted_engine::4
>> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineUnexpectedlyDown (score: 0)
>> MainThread::INFO::2018-05-30 00:08:34,534::hosted_engine::4
>> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Best remote host ovirt3.nwfiber.com (id: 3, score: 3400)
>> MainThread::INFO::2018-05-30 00:08:34,545::states::678::ovi
>> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
>> down, local host does not have best score
>> MainThread::INFO::2018-05-30 00:12:40,627::states::693::ovi
>> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is
>> 0 due to unexpected vm shutdown at Wed May 30 00:04:28 2018
>> MainThread::INFO::2018-05-30 00:12:40,628::hosted_engine::4
>> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineUnexpectedlyDown (score: 0)
>> MainThread::INFO::2018-05-30 00:12:40,628::hosted_engine::4
>> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Best remote host ovirt1.nwfiber.com (id: 1, score: 3400)
>> MainThread::INFO::2018-05-30 00:12:40,643::states::678::ovi
>> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
>> down, local host does not have best score
>> MainThread::INFO::2018-05-30 00:12:40,827::ovf_store::151::
>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
>> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7
>> 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
>> MainThread::INFO::2018-05-30 00:16:46,635::states::693::ovi
>> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is
>> 0 due to unexpected vm shutdown at Wed May 30 00:04:28 2018
>> MainThread::INFO::2018-05-30 00:16:46,636::hosted_engine::4
>> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineUnexpectedlyDown (score: 0)
>> MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1
>> 69::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> Global metadata: {'maintenance': False}
>> MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1
>> 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> Host ovirt1.nwfiber.com (id 1): {'conf_on_shared_storage': True,
>> 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=7614
>> (Wed May 30 00:09:54 2018)\nhost-id=1\nscore=3400\nvm_conf_refresh_time=7614
>> (Wed May 30 00:09:54 2018)\nconf_on_shared_storage=
>> True\nmaintenance=False\nstate=EngineStarting\nstopped=False\n',
>> 'hostname': 'ovirt1.nwfiber.com', 'alive': True,
'host-id': 1,
>> 'engine-status': {'reason': 'vm not running on this
host', 'health': 'bad',
>> 'vm': 'down', 'detail': 'unknown'},
'score': 3400, 'stopped': False,
>> 'maintenance': False, 'crc32': '1fa13e84',
'local_conf_timestamp': 7614,
>> 'host-ts': 7614}
>> MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1
>> 74::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> Host ovirt3.nwfiber.com (id 3): {'conf_on_shared_storage': True,
>> 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=99541
>> (Wed May 30 00:08:43 2018)\nhost-id=3\nscore=3400\nvm_conf_refresh_time=99542
>> (Wed May 30 00:08:43 2018)\nconf_on_shared_storage=
>> True\nmaintenance=False\nstate=EngineForceStop\nstopped=False\n',
>> 'hostname': 'ovirt3.nwfiber.com', 'alive': True,
'host-id': 3,
>> 'engine-status': {'reason': 'bad vm status',
'health': 'bad', 'vm': 'down',
>> 'detail': 'Down'}, 'score': 3400, 'stopped':
False, 'maintenance': False,
>> 'crc32': '56f87e2c', 'local_conf_timestamp': 99542,
'host-ts': 99541}
>> MainThread::INFO::2018-05-30 00:16:46,648::state_machine::1
>> 77::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
>> Local (id 2): {'engine-health': {'reason': 'bad vm
status', 'health':
>> 'bad', 'vm': 'down_unexpected', 'detail':
'Down'}, 'bridge': True,
>> 'mem-free': 30891.0, 'maintenance': False, 'cpu-load':
0.0186, 'gateway':
>> 1.0, 'storage-domain': True}
>> MainThread::INFO::2018-05-30 00:16:46,919::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineUnexpectedlyDown-EngineDown)
>> sent? sent
>> MainThread::INFO::2018-05-30 00:20:52,918::hosted_engine::4
>> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineDown (score: 3400)
>> MainThread::INFO::2018-05-30 00:20:52,918::hosted_engine::4
>> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Best remote host ovirt1.nwfiber.com (id: 1, score: 3400)
>> MainThread::INFO::2018-05-30 00:20:52,930::states::508::ovi
>> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
>> down and local host has best score (3400), attempting to start engine VM
>> MainThread::INFO::2018-05-30 00:20:53,203::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineDown-EngineStart) sent? sent
>> MainThread::INFO::2018-05-30 00:20:53,393::ovf_store::151::
>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
>> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7
>> 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
>>
>> --------
>> ovirt3
>> -------
>> MainThread::INFO::2018-05-30 00:21:07,343::hosted_engine::9
>> 29::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> Starting vm using `/usr/sbin/hosted-engine --vm-start`
>> MainThread::INFO::2018-05-30 00:21:08,107::hosted_engine::9
>> 35::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> stdout:
>> MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9
>> 36::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> stderr: Command VM.getStats with args {'vmID':
>> '50392390-4f89-435c-bf3b-8254b58a4ef7'} failed:
>> (code=1, message=Virtual machine does not exist: {'vmId':
>> u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
>>
>> MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9
>> 48::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> Engine VM started on localhost
>> MainThread::INFO::2018-05-30 00:21:08,413::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineStart-EngineStarting) sent? sent
>> [root@ovirt3 images]# tail -n 20 /var/log/ovirt-hosted-engine-h
>> a/agent.log
>> MainThread::INFO::2018-05-30 00:12:51,792::hosted_engine::1012::
>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
>> stdout:
>> MainThread::INFO::2018-05-30 00:12:51,792::hosted_engine::1013::
>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
>> stderr:
>> MainThread::ERROR::2018-05-30 00:12:51,793::hosted_engine::1021::
>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
>> Engine VM stopped on localhost
>> MainThread::INFO::2018-05-30 00:12:52,157::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineForceStop-EngineDown) sent? sent
>> MainThread::INFO::2018-05-30 00:16:58,237::hosted_engine::4
>> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineDown (score: 3400)
>> MainThread::INFO::2018-05-30 00:16:58,256::states::508::ovi
>> rt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine
>> down and local host has best score (3400), attempting to start engine VM
>> MainThread::INFO::2018-05-30 00:16:58,595::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineDown-EngineStart) sent? sent
>> MainThread::INFO::2018-05-30 00:16:58,816::ovf_store::151::
>> ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>> OVF_STORE volume path: /var/run/vdsm/storage/c0acdefb
>> -7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7
>> 887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
>> MainThread::INFO::2018-05-30 00:21:05,082::hosted_engine::4
>> 91::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineStart (score: 3400)
>> MainThread::INFO::2018-05-30 00:21:05,083::hosted_engine::4
>> 99::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Best remote host ovirt1.nwfiber.com (id: 1, score: 3400)
>> MainThread::INFO::2018-05-30 00:21:05,125::hosted_engine::9
>> 68::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
>> Ensuring VDSM state is clear for engine VM
>> MainThread::INFO::2018-05-30 00:21:05,141::hosted_engine::9
>> 88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
>> Cleaning state for non-running VM
>> MainThread::INFO::2018-05-30 00:21:07,343::hosted_engine::9
>> 80::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
>> Vdsm state for VM clean
>> MainThread::INFO::2018-05-30 00:21:07,343::hosted_engine::9
>> 29::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> Starting vm using `/usr/sbin/hosted-engine --vm-start`
>> MainThread::INFO::2018-05-30 00:21:08,107::hosted_engine::9
>> 35::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> stdout:
>> MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9
>> 36::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> stderr: Command VM.getStats with args {'vmID':
>> '50392390-4f89-435c-bf3b-8254b58a4ef7'} failed:
>> (code=1, message=Virtual machine does not exist: {'vmId':
>> u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
>>
>> MainThread::INFO::2018-05-30 00:21:08,108::hosted_engine::9
>> 48::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> Engine VM started on localhost
>> MainThread::INFO::2018-05-30 00:21:08,413::brokerlink::68::
>> ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was
>> notification of state_transition (EngineStart-EngineStarting) sent? sent
>>
>> ----------
>> note that virsh does not show engine started on this machine....
>>
>> On Wed, May 30, 2018 at 12:16 AM, Jim Kusznir <jim(a)palousetech.com&gt;
>> wrote:
>>
>>> So, on ovirt2, I ran the hosted-engine --vm-status and it took over 2
>>> minutes to get it to return, but two of the 3 machines had current
>>> (non-stale) data:
>>>
>>> [root@ovirt2 ~]# hosted-engine --vm-status
>>>
>>>
>>> --== Host 1 status ==--
>>>
>>> conf_on_shared_storage             : True
>>> Status up-to-date                  : False
>>> Hostname                           : ovirt1.nwfiber.com
>>> Host ID                            : 1
>>> Engine status                      : unknown stale-data
>>> Score                              : 3400
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : f562054b
>>> local_conf_timestamp               : 7368
>>> Host timestamp                     : 7367
>>> Extra metadata (valid at timestamp):
>>> metadata_parse_version=1
>>> metadata_feature_version=1
>>> timestamp=7367 (Wed May 30 00:05:47 2018)
>>> host-id=1
>>> score=3400
>>> vm_conf_refresh_time=7368 (Wed May 30 00:05:48 2018)
>>> conf_on_shared_storage=True
>>> maintenance=False
>>> state=EngineStart
>>> stopped=False
>>>
>>>
>>> --== Host 2 status ==--
>>>
>>> conf_on_shared_storage             : True
>>> Status up-to-date                  : True
>>> Hostname                           : ovirt2.nwfiber.com
>>> Host ID                            : 2
>>> Engine status                      : {"reason": "bad vm
status",
>>> "health": "bad", "vm":
"down_unexpected", "detail": "Down"}
>>> Score                              : 0
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : 6822f86a
>>> local_conf_timestamp               : 3610
>>> Host timestamp                     : 3610
>>> Extra metadata (valid at timestamp):
>>> metadata_parse_version=1
>>> metadata_feature_version=1
>>> timestamp=3610 (Wed May 30 00:04:28 2018)
>>> host-id=2
>>> score=0
>>> vm_conf_refresh_time=3610 (Wed May 30 00:04:28 2018)
>>> conf_on_shared_storage=True
>>> maintenance=False
>>> state=EngineUnexpectedlyDown
>>> stopped=False
>>> timeout=Wed Dec 31 17:09:30 1969
>>>
>>>
>>> --== Host 3 status ==--
>>>
>>> conf_on_shared_storage             : True
>>> Status up-to-date                  : True
>>> Hostname                           : ovirt3.nwfiber.com
>>> Host ID                            : 3
>>> Engine status                      : {"reason": "vm not
running on this
>>> host", "health": "bad", "vm":
"down", "detail": "unknown"}
>>> Score                              : 3400
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : 006512a3
>>> local_conf_timestamp               : 99294
>>> Host timestamp                     : 99294
>>> Extra metadata (valid at timestamp):
>>> metadata_parse_version=1
>>> metadata_feature_version=1
>>> timestamp=99294 (Wed May 30 00:04:36 2018)
>>> host-id=3
>>> score=3400
>>> vm_conf_refresh_time=99294 (Wed May 30 00:04:36 2018)
>>> conf_on_shared_storage=True
>>> maintenance=False
>>> state=EngineStarting
>>> stopped=False
>>> ---------------
>>>
>>> I saw that the ha score was 0 on ovirt2 (where I was focusing my
>>> efforts), and ovirt3 had a good ha score, I tried to launch it there:
>>>
>>> [root@ovirt3 ff3d0185-d120-40ee-9792-ebbd5caaca8b]# hosted-engine
>>> --vm-start
>>> Command VM.getStats with args {'vmID':
'50392390-4f89-435c-bf3b-8254b58a4ef7'}
>>> failed:
>>> (code=1, message=Virtual machine does not exist: {'vmId':
>>> u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
>>>
>>>
>>> ovirt1 was returning the ha-agent or storage not online message, but
>>> now its hanging (probably 2 minutes), so it may actually return status this
>>> time.  I figure if it can't return status, then there's no point in
trying
>>> to start the VM there.
>>>
>>>
>>>
>>> On Wed, May 30, 2018 at 12:08 AM, Jim Kusznir <jim(a)palousetech.com&gt;
>>> wrote:
>>>
>>>> The agent..log was sent minutes ago in another message.  Here's the
>>>> gluster client log for the engine volume:
>>>>
>>>> [2018-05-30 06:31:12.343709] I [MSGID: 108026]
>>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>>> 0-engine-replicate-0: performing metadata selfheal on
>>>> 3a430e50-48cb-463c-8657-aab5458ac155
>>>> [2018-05-30 06:31:12.348210] W [MSGID: 108015]
>>>> [afr-self-heal-entry.c:56:afr_selfheal_entry_delete]
>>>> 0-engine-replicate-0: expunging file
117c5234-e6d6-4f03-809b-a4ef6a6fac4d/hosted-engine.metadata
>>>> (38a277cd-113d-4ca8-8f48-fadc944d3272) on engine-client-1
>>>> [2018-05-30 06:31:12.360602] I [MSGID: 108026]
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>>> Completed metadata selfheal on 3a430e50-48cb-463c-8657-aab5458ac155.
>>>> sources=[0] 2  sinks=1
>>>> [2018-05-30 06:31:12.379169] I [MSGID: 108026]
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>>> Completed entry selfheal on 117c5234-e6d6-4f03-809b-a4ef6a6fac4d.
>>>> sources=[0] 2  sinks=1
>>>> [2018-05-30 06:32:54.756729] I [MSGID: 108026]
>>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>>> 0-engine-replicate-0: performing metadata selfheal on
>>>> 2ecf0f33-1385-416e-8619-129bdb62b61f
>>>> [2018-05-30 06:32:54.763175] I [MSGID: 108026]
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>>> Completed metadata selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>>> sources=[0] 2  sinks=1
>>>> [2018-05-30 06:37:06.581538] I [MSGID: 108026]
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>>> Completed data selfheal on 06c557a9-b550-4e25-b190-804e50cea65b.
>>>> sources=[0] 2  sinks=1
>>>> [2018-05-30 06:39:55.739302] I [MSGID: 108026]
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>>> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>>> sources=[0] 2  sinks=1
>>>> [2018-05-30 06:44:04.911825] I [MSGID: 108026]
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>>> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>>> sources=[0] 2  sinks=1
>>>> [2018-05-30 06:58:05.123854] I [MSGID: 108026]
>>>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>>> 0-engine-replicate-0: performing metadata selfheal on
>>>> 03ca7a59-e8ee-4a70-bc68-47e972aeb801
>>>> [2018-05-30 06:58:05.131929] I [MSGID: 108026]
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>>> Completed metadata selfheal on 03ca7a59-e8ee-4a70-bc68-47e972aeb801.
>>>> sources=[0] 2  sinks=1
>>>> [2018-05-30 07:01:42.583991] I [MSGID: 108026]
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>>> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>>> sources=[0] 2  sinks=1
>>>> [2018-05-30 07:05:48.777982] I [MSGID: 108026]
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
>>>> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
>>>> sources=[0] 2  sinks=1
>>>>
>>>> --------
>>>> broker log:
>>>> --------
>>>> Thread-1::INFO::2018-05-30 00:07:13,241::ping::60::ping.Ping::(action)
>>>> Successfully pinged 192.168.8.1
>>>> Listener::INFO::2018-05-30 00:07:14,344::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:07:19,356::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Thread-5::INFO::2018-05-30 00:07:22,081::engine_health::2
>>>> 02::engine_health.EngineHealth::(_result_from_stats) VM not running
>>>> on this host, status Down
>>>> Listener::INFO::2018-05-30 00:07:24,369::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:07:29,382::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:07:34,393::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:07:39,406::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:07:44,419::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:07:49,431::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:07:54,444::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:07:59,452::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Thread-2::INFO::2018-05-30 00:08:03,736::mgmt_bridge::62:
>>>> :mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports
>>>> Listener::INFO::2018-05-30 00:08:04,460::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:08:09,468::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Thread-4::INFO::2018-05-30 00:08:11,461::cpu_load_no_engi
>>>> ne::116::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) VM
>>>> stats do not contain cpu usage. VM might be down.
>>>> Thread-4::INFO::2018-05-30 00:08:11,461::cpu_load_no_engi
>>>> ne::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System
>>>> load total=0.0169, engine=0.0000, non-engine=0.0169
>>>> Listener::INFO::2018-05-30 00:08:14,481::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Listener::INFO::2018-05-30 00:08:19,490::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>> Thread-5::INFO::2018-05-30 00:08:22,175::engine_health::2
>>>> 02::engine_health.EngineHealth::(_result_from_stats) VM not running
>>>> on this host, status Down
>>>> Thread-1::INFO::2018-05-30 00:08:22,436::ping::60::ping.Ping::(action)
>>>> Successfully pinged 192.168.8.1
>>>> Listener::INFO::2018-05-30 00:08:24,503::storage_broker::304::
>>>> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
>>>> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>>>>
>>>>
>>>> On Wed, May 30, 2018 at 12:03 AM, Sahina Bose <sabose(a)redhat.com&gt;
>>>> wrote:
>>>>
>>>>> [Adding gluster-users back]
>>>>>
>>>>> Nothing amiss with volume info and status.
>>>>> Can you check the agent.log and broker.log - will be under
>>>>> /var/log/ovirt-hosted-engine-ha/
>>>>>
>>>>> Also the gluster client logs - under /var/log/glusterfs/rhev-data-c
>>>>> enter-mnt-glusterSD<volume>.log
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 30, 2018 at 12:08 PM, Jim Kusznir
<jim(a)palousetech.com&gt;
>>>>> wrote:
>>>>>
>>>>>> I believe the gluster data store for the engine is up and
working
>>>>>> correctly.  The rest are not mounted, as the engine hasn't
managed to start
>>>>>> correctly yet.  I did perform the copy-a-junk-file into the data
store,
>>>>>> then check to ensure its there on another host, then delete that
and see
>>>>>> that it disappeared on the first host; it passed that test. 
Here's the
>>>>>> info and status.  (I have NOT performed the steps that Krutika
and
>>>>>> Ravishankar suggested yet, as I don't have my data volumes
working again
>>>>>> yet.
>>>>>>
>>>>>> [root@ovirt2 images]# gluster volume info
>>>>>>
>>>>>> Volume Name: data
>>>>>> Type: Replicate
>>>>>> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
>>>>>> Status: Started
>>>>>> Snapshot Count: 0
>>>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
>>>>>> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
>>>>>> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
>>>>>> Options Reconfigured:
>>>>>> diagnostics.count-fop-hits: on
>>>>>> diagnostics.latency-measurement: on
>>>>>> changelog.changelog: on
>>>>>> geo-replication.ignore-pid-check: on
>>>>>> geo-replication.indexing: on
>>>>>> server.allow-insecure: on
>>>>>> performance.readdir-ahead: on
>>>>>> performance.quick-read: off
>>>>>> performance.read-ahead: off
>>>>>> performance.io-cache: off
>>>>>> performance.stat-prefetch: off
>>>>>> cluster.eager-lock: enable
>>>>>> network.remote-dio: enable
>>>>>> cluster.quorum-type: auto
>>>>>> cluster.server-quorum-type: server
>>>>>> storage.owner-uid: 36
>>>>>> storage.owner-gid: 36
>>>>>> features.shard: on
>>>>>> features.shard-block-size: 512MB
>>>>>> performance.low-prio-threads: 32
>>>>>> cluster.data-self-heal-algorithm: full
>>>>>> cluster.locking-scheme: granular
>>>>>> cluster.shd-wait-qlength: 10000
>>>>>> cluster.shd-max-threads: 8
>>>>>> network.ping-timeout: 30
>>>>>> user.cifs: off
>>>>>> nfs.disable: on
>>>>>> performance.strict-o-direct: on
>>>>>>
>>>>>> Volume Name: data-hdd
>>>>>> Type: Replicate
>>>>>> Volume ID: d342a3ab-16f3-49f0-bbcf-f788be8ac5f1
>>>>>> Status: Started
>>>>>> Snapshot Count: 0
>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: 172.172.1.11:/gluster/brick3/data-hdd
>>>>>> Brick2: 172.172.1.12:/gluster/brick3/data-hdd
>>>>>> Brick3: 172.172.1.13:/gluster/brick3/data-hdd
>>>>>> Options Reconfigured:
>>>>>> diagnostics.count-fop-hits: on
>>>>>> diagnostics.latency-measurement: on
>>>>>> network.ping-timeout: 30
>>>>>> server.allow-insecure: on
>>>>>> storage.owner-gid: 36
>>>>>> storage.owner-uid: 36
>>>>>> user.cifs: off
>>>>>> features.shard: on
>>>>>> cluster.shd-wait-qlength: 10000
>>>>>> cluster.shd-max-threads: 8
>>>>>> cluster.locking-scheme: granular
>>>>>> cluster.data-self-heal-algorithm: full
>>>>>> cluster.server-quorum-type: server
>>>>>> cluster.quorum-type: auto
>>>>>> cluster.eager-lock: enable
>>>>>> network.remote-dio: enable
>>>>>> performance.low-prio-threads: 32
>>>>>> performance.stat-prefetch: off
>>>>>> performance.io-cache: off
>>>>>> performance.read-ahead: off
>>>>>> performance.quick-read: off
>>>>>> changelog.changelog: on
>>>>>> geo-replication.ignore-pid-check: on
>>>>>> geo-replication.indexing: on
>>>>>> transport.address-family: inet
>>>>>> performance.readdir-ahead: on
>>>>>>
>>>>>> Volume Name: engine
>>>>>> Type: Replicate
>>>>>> Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de
>>>>>> Status: Started
>>>>>> Snapshot Count: 0
>>>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine
>>>>>> Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine
>>>>>> Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter)
>>>>>> Options Reconfigured:
>>>>>> changelog.changelog: on
>>>>>> geo-replication.ignore-pid-check: on
>>>>>> geo-replication.indexing: on
>>>>>> performance.readdir-ahead: on
>>>>>> performance.quick-read: off
>>>>>> performance.read-ahead: off
>>>>>> performance.io-cache: off
>>>>>> performance.stat-prefetch: off
>>>>>> cluster.eager-lock: enable
>>>>>> network.remote-dio: off
>>>>>> cluster.quorum-type: auto
>>>>>> cluster.server-quorum-type: server
>>>>>> storage.owner-uid: 36
>>>>>> storage.owner-gid: 36
>>>>>> features.shard: on
>>>>>> features.shard-block-size: 512MB
>>>>>> performance.low-prio-threads: 32
>>>>>> cluster.data-self-heal-algorithm: full
>>>>>> cluster.locking-scheme: granular
>>>>>> cluster.shd-wait-qlength: 10000
>>>>>> cluster.shd-max-threads: 6
>>>>>> network.ping-timeout: 30
>>>>>> user.cifs: off
>>>>>> nfs.disable: on
>>>>>> performance.strict-o-direct: on
>>>>>>
>>>>>> Volume Name: iso
>>>>>> Type: Replicate
>>>>>> Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92
>>>>>> Status: Started
>>>>>> Snapshot Count: 0
>>>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso
>>>>>> Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso
>>>>>> Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter)
>>>>>> Options Reconfigured:
>>>>>> performance.strict-o-direct: on
>>>>>> nfs.disable: on
>>>>>> user.cifs: off
>>>>>> network.ping-timeout: 30
>>>>>> cluster.shd-max-threads: 6
>>>>>> cluster.shd-wait-qlength: 10000
>>>>>> cluster.locking-scheme: granular
>>>>>> cluster.data-self-heal-algorithm: full
>>>>>> performance.low-prio-threads: 32
>>>>>> features.shard-block-size: 512MB
>>>>>> features.shard: on
>>>>>> storage.owner-gid: 36
>>>>>> storage.owner-uid: 36
>>>>>> cluster.server-quorum-type: server
>>>>>> cluster.quorum-type: auto
>>>>>> network.remote-dio: off
>>>>>> cluster.eager-lock: enable
>>>>>> performance.stat-prefetch: off
>>>>>> performance.io-cache: off
>>>>>> performance.read-ahead: off
>>>>>> performance.quick-read: off
>>>>>> performance.readdir-ahead: on
>>>>>>
>>>>>>
>>>>>> [root@ovirt2 images]# gluster volume status
>>>>>> Status of volume: data
>>>>>> Gluster process                             TCP Port  RDMA Port
>>>>>> Online  Pid
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> Brick ovirt1.nwfiber.com:/gluster/brick2/da
>>>>>> ta                                          49152     0         
Y
>>>>>>      3226
>>>>>> Brick ovirt2.nwfiber.com:/gluster/brick2/da
>>>>>> ta                                          49152     0         
Y
>>>>>>      2967
>>>>>> Brick ovirt3.nwfiber.com:/gluster/brick2/da
>>>>>> ta                                          49152     0         
Y
>>>>>>      2554
>>>>>> Self-heal Daemon on localhost               N/A       N/A       
Y
>>>>>>      4818
>>>>>> Self-heal Daemon on ovirt3.nwfiber.com      N/A       N/A       
Y
>>>>>>      17853
>>>>>> Self-heal Daemon on ovirt1.nwfiber.com      N/A       N/A       
Y
>>>>>>      4771
>>>>>>
>>>>>> Task Status of Volume data
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> There are no active volume tasks
>>>>>>
>>>>>> Status of volume: data-hdd
>>>>>> Gluster process                             TCP Port  RDMA Port
>>>>>> Online  Pid
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> Brick 172.172.1.11:/gluster/brick3/data-hdd 49153     0         
Y
>>>>>>      3232
>>>>>> Brick 172.172.1.12:/gluster/brick3/data-hdd 49153     0         
Y
>>>>>>      2976
>>>>>> Brick 172.172.1.13:/gluster/brick3/data-hdd N/A       N/A       
N
>>>>>>      N/A
>>>>>> NFS Server on localhost                     N/A       N/A       
N
>>>>>>      N/A
>>>>>> Self-heal Daemon on localhost               N/A       N/A       
Y
>>>>>>      4818
>>>>>> NFS Server on ovirt3.nwfiber.com            N/A       N/A       
N
>>>>>>      N/A
>>>>>> Self-heal Daemon on ovirt3.nwfiber.com      N/A       N/A       
Y
>>>>>>      17853
>>>>>> NFS Server on ovirt1.nwfiber.com            N/A       N/A       
N
>>>>>>      N/A
>>>>>> Self-heal Daemon on ovirt1.nwfiber.com      N/A       N/A       
Y
>>>>>>      4771
>>>>>>
>>>>>> Task Status of Volume data-hdd
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> There are no active volume tasks
>>>>>>
>>>>>> Status of volume: engine
>>>>>> Gluster process                             TCP Port  RDMA Port
>>>>>> Online  Pid
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> Brick ovirt1.nwfiber.com:/gluster/brick1/en
>>>>>> gine                                        49154     0         
Y
>>>>>>      3239
>>>>>> Brick ovirt2.nwfiber.com:/gluster/brick1/en
>>>>>> gine                                        49154     0         
Y
>>>>>>      2982
>>>>>> Brick ovirt3.nwfiber.com:/gluster/brick1/en
>>>>>> gine                                        49154     0         
Y
>>>>>>      2578
>>>>>> Self-heal Daemon on localhost               N/A       N/A       
Y
>>>>>>      4818
>>>>>> Self-heal Daemon on ovirt3.nwfiber.com      N/A       N/A       
Y
>>>>>>      17853
>>>>>> Self-heal Daemon on ovirt1.nwfiber.com      N/A       N/A       
Y
>>>>>>      4771
>>>>>>
>>>>>> Task Status of Volume engine
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> There are no active volume tasks
>>>>>>
>>>>>> Status of volume: iso
>>>>>> Gluster process                             TCP Port  RDMA Port
>>>>>> Online  Pid
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> Brick ovirt1.nwfiber.com:/gluster/brick4/is
>>>>>> o                                           49155     0         
Y
>>>>>>      3247
>>>>>> Brick ovirt2.nwfiber.com:/gluster/brick4/is
>>>>>> o                                           49155     0         
Y
>>>>>>      2990
>>>>>> Brick ovirt3.nwfiber.com:/gluster/brick4/is
>>>>>> o                                           49155     0         
Y
>>>>>>      2580
>>>>>> Self-heal Daemon on localhost               N/A       N/A       
Y
>>>>>>      4818
>>>>>> Self-heal Daemon on ovirt3.nwfiber.com      N/A       N/A       
Y
>>>>>>      17853
>>>>>> Self-heal Daemon on ovirt1.nwfiber.com      N/A       N/A       
Y
>>>>>>      4771
>>>>>>
>>>>>> Task Status of Volume iso
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> There are no active volume tasks
>>>>>>
>>>>>>
>>>>>> On Tue, May 29, 2018 at 11:30 PM, Sahina Bose
<sabose(a)redhat.com&gt;
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 30, 2018 at 10:42 AM, Jim Kusznir
<jim(a)palousetech.com&gt;
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hosted-engine --deploy failed (would not come up on my
existing
>>>>>>>> gluster storage).  However, I realized no changes were
written to my
>>>>>>>> existing storage.  So, I went back to trying to get my
old engine running.
>>>>>>>>
>>>>>>>> hosted-engine --vm-status is now taking a very long time
>>>>>>>> (5+minutes) to return, and it returns stail information
everywhere.  I
>>>>>>>> thought perhaps the lockspace is corrupt, so tried to
clean that and
>>>>>>>> metadata, but both are failing (--cleam-metadata has hung
and I can't even
>>>>>>>> ctrl-c out of it).
>>>>>>>>
>>>>>>>> How can I reinitialize all the lockspace/metadata safely?
 There
>>>>>>>> is no engine or VMs running currently....
>>>>>>>>
>>>>>>>
>>>>>>> I think the first thing to make sure is that your storage is
up and
>>>>>>> running. So can you mount the gluster volumes and able to
access the
>>>>>>> contents there?
>>>>>>> Please provide the gluster volume info and gluster volume
status of
>>>>>>> the volumes that you're using.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> --Jim
>>>>>>>>
>>>>>>>> On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir
<jim(a)palousetech.com&gt;
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Well, things went from bad to very, very bad....
>>>>>>>>>
>>>>>>>>> It appears that during one of the 2 minute lockups,
the fencing
>>>>>>>>> agents decided that another node in the cluster was
down.  As a result, 2
>>>>>>>>> of the 3 nodes were simultaneously reset with fencing
agent reboot.  After
>>>>>>>>> the nodes came back up, the engine would not start. 
All running VMs
>>>>>>>>> (including VMs on the 3rd node that was not rebooted)
crashed.
>>>>>>>>>
>>>>>>>>> I've now been working for about 3 hours trying to
get the engine
>>>>>>>>> to come up.  I don't know why it won't start.
 hosted-engine --vm-start
>>>>>>>>> says its starting, but it doesn't start (virsh
doesn't show any VMs
>>>>>>>>> running).  I'm currently running --deploy, as I
had run out of options for
>>>>>>>>> anything else I can come up with.  I hope this will
allow me to re-import
>>>>>>>>> all my existing VMs and allow me to start them back
up after everything
>>>>>>>>> comes back up.
>>>>>>>>>
>>>>>>>>> I do have an unverified geo-rep backup; I don't
know if it is a
>>>>>>>>> good backup (there were several prior messages to
this list, but I didn't
>>>>>>>>> get replies to my questions.  It was running in what
I believe to be
>>>>>>>>> "strange", and the data directories are
larger than their source).
>>>>>>>>>
>>>>>>>>> I'll see if my --deploy works, and if not,
I'll be back with
>>>>>>>>> another message/help request.
>>>>>>>>>
>>>>>>>>> When the dust settles and I'm at least minimally
functional
>>>>>>>>> again, I really want to understand why all these
technologies designed to
>>>>>>>>> offer redundancy conspired to reduce uptime and
create failures where there
>>>>>>>>> weren't any otherwise.  I thought with hosted
engine, 3 ovirt servers and
>>>>>>>>> glusterfs with minimum replica 2+arb or replica 3
should have offered
>>>>>>>>> strong resilience against server failure or disk
failure, and should have
>>>>>>>>> prevented / recovered from data corruption.  Instead,
all of the above
>>>>>>>>> happened (once I get my cluster back up, I still have
to try and recover my
>>>>>>>>> webserver VM, which won't boot due to XFS corrupt
journal issues created
>>>>>>>>> during the gluster crashes).  I think a lot of these
issues were rooted
>>>>>>>>> from the upgrade from 4.1 to 4.2.
>>>>>>>>>
>>>>>>>>> --Jim
>>>>>>>>>
>>>>>>>>> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir
<jim(a)palousetech.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> I also finally found the following in my system
log on one
>>>>>>>>>> server:
>>>>>>>>>>
>>>>>>>>>> [10679.524491] INFO: task glusterclogro:14933
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [10679.525826] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [10679.527144] glusterclogro   D ffff97209832bf40
    0 14933
>>>>>>>>>>   1 0x00000080
>>>>>>>>>> [10679.527150] Call Trace:
>>>>>>>>>> [10679.527161]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [10679.527218]  [<ffffffffc060e388>]
>>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>>> [10679.527225]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>>> [10679.527254]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>>> [xfs]
>>>>>>>>>> [10679.527260]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>>> [10679.527268]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [10679.527271]  [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>>>>>>> [10679.527275]  [<ffffffffb992082f>]
>>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>>> [10679.527279]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [10679.527283] INFO: task glusterposixfsy:14941
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [10679.528608] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [10679.529956] glusterposixfsy D ffff972495f84f10
    0 14941
>>>>>>>>>>   1 0x00000080
>>>>>>>>>> [10679.529961] Call Trace:
>>>>>>>>>> [10679.529966]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [10679.530003]  [<ffffffffc060e388>]
>>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>>> [10679.530008]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>>> [10679.530038]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>>> [xfs]
>>>>>>>>>> [10679.530042]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>>> [10679.530046]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [10679.530050]  [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>>>>>> [10679.530054]  [<ffffffffb992082f>]
>>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>>> [10679.530058]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [10679.530062] INFO: task glusteriotwr13:15486
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [10679.531805] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [10679.533732] glusteriotwr13  D ffff9720a83f0000
    0 15486
>>>>>>>>>>   1 0x00000080
>>>>>>>>>> [10679.533738] Call Trace:
>>>>>>>>>> [10679.533747]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [10679.533799]  [<ffffffffc060e388>]
>>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>>> [10679.533806]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>>> [10679.533846]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>>> [xfs]
>>>>>>>>>> [10679.533852]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>>> [10679.533858]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [10679.533863]  [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>>>>>> [10679.533868]  [<ffffffffb992082f>]
>>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>>> [10679.533873]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [10919.512757] INFO: task glusterclogro:14933
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [10919.514714] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [10919.516663] glusterclogro   D ffff97209832bf40
    0 14933
>>>>>>>>>>   1 0x00000080
>>>>>>>>>> [10919.516677] Call Trace:
>>>>>>>>>> [10919.516690]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [10919.516696]  [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>>>>> [10919.516703]  [<ffffffffb951cc04>] ?
blk_finish_plug+0x14/0x40
>>>>>>>>>> [10919.516768]  [<ffffffffc05e9224>] ?
>>>>>>>>>> _xfs_buf_ioapply+0x334/0x460 [xfs]
>>>>>>>>>> [10919.516774]  [<ffffffffb991432d>]
>>>>>>>>>> wait_for_completion+0xfd/0x140
>>>>>>>>>> [10919.516782]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>>> [10919.516821]  [<ffffffffc05eb0a3>] ?
_xfs_buf_read+0x23/0x40
>>>>>>>>>> [xfs]
>>>>>>>>>> [10919.516859]  [<ffffffffc05eafa9>]
>>>>>>>>>> xfs_buf_submit_wait+0xf9/0x1d0 [xfs]
>>>>>>>>>> [10919.516902]  [<ffffffffc061b279>] ?
>>>>>>>>>> xfs_trans_read_buf_map+0x199/0x400 [xfs]
>>>>>>>>>> [10919.516940]  [<ffffffffc05eb0a3>]
_xfs_buf_read+0x23/0x40
>>>>>>>>>> [xfs]
>>>>>>>>>> [10919.516977]  [<ffffffffc05eb1b9>]
xfs_buf_read_map+0xf9/0x160
>>>>>>>>>> [xfs]
>>>>>>>>>> [10919.517022]  [<ffffffffc061b279>]
>>>>>>>>>> xfs_trans_read_buf_map+0x199/0x400 [xfs]
>>>>>>>>>> [10919.517057]  [<ffffffffc05c8d04>]
xfs_da_read_buf+0xd4/0x100
>>>>>>>>>> [xfs]
>>>>>>>>>> [10919.517091]  [<ffffffffc05c8d53>]
xfs_da3_node_read+0x23/0xd0
>>>>>>>>>> [xfs]
>>>>>>>>>> [10919.517126]  [<ffffffffc05c9fee>]
>>>>>>>>>> xfs_da3_node_lookup_int+0x6e/0x2f0 [xfs]
>>>>>>>>>> [10919.517160]  [<ffffffffc05d5a1d>]
>>>>>>>>>> xfs_dir2_node_lookup+0x4d/0x170 [xfs]
>>>>>>>>>> [10919.517194]  [<ffffffffc05ccf5d>]
xfs_dir_lookup+0x1bd/0x1e0
>>>>>>>>>> [xfs]
>>>>>>>>>> [10919.517233]  [<ffffffffc05fd8d9>]
xfs_lookup+0x69/0x140 [xfs]
>>>>>>>>>> [10919.517271]  [<ffffffffc05fa018>]
xfs_vn_lookup+0x78/0xc0
>>>>>>>>>> [xfs]
>>>>>>>>>> [10919.517278]  [<ffffffffb9425cf3>]
lookup_real+0x23/0x60
>>>>>>>>>> [10919.517283]  [<ffffffffb9426702>]
__lookup_hash+0x42/0x60
>>>>>>>>>> [10919.517288]  [<ffffffffb942d519>]
SYSC_renameat2+0x3a9/0x5a0
>>>>>>>>>> [10919.517296]  [<ffffffffb94d3753>] ?
>>>>>>>>>> selinux_file_free_security+0x23/0x30
>>>>>>>>>> [10919.517304]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [10919.517309]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [10919.517313]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [10919.517318]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [10919.517323]  [<ffffffffb942e58e>]
SyS_renameat2+0xe/0x10
>>>>>>>>>> [10919.517328]  [<ffffffffb942e5ce>]
SyS_rename+0x1e/0x20
>>>>>>>>>> [10919.517333]  [<ffffffffb992082f>]
>>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>>> [10919.517339]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [11159.496095] INFO: task glusteriotwr9:15482
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [11159.497546] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [11159.498978] glusteriotwr9   D ffff971fa0fa1fa0
    0 15482
>>>>>>>>>>   1 0x00000080
>>>>>>>>>> [11159.498984] Call Trace:
>>>>>>>>>> [11159.498995]  [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50
>>>>>>>>>> [11159.498999]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [11159.499003]  [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>>>>> [11159.499056]  [<ffffffffc05dd9b7>] ?
>>>>>>>>>> xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]
>>>>>>>>>> [11159.499082]  [<ffffffffc05dd43e>] ?
>>>>>>>>>> xfs_iext_bno_to_irec+0x8e/0xd0 [xfs]
>>>>>>>>>> [11159.499090]  [<ffffffffb92f7a12>] ?
ktime_get_ts64+0x52/0xf0
>>>>>>>>>> [11159.499093]  [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50
>>>>>>>>>> [11159.499097]  [<ffffffffb991348d>]
>>>>>>>>>> io_schedule_timeout+0xad/0x130
>>>>>>>>>> [11159.499101]  [<ffffffffb9913528>]
io_schedule+0x18/0x20
>>>>>>>>>> [11159.499104]  [<ffffffffb9911f11>]
bit_wait_io+0x11/0x50
>>>>>>>>>> [11159.499107]  [<ffffffffb9911ac1>]
__wait_on_bit_lock+0x61/0xc0
>>>>>>>>>> [11159.499113]  [<ffffffffb9393634>]
__lock_page+0x74/0x90
>>>>>>>>>> [11159.499118]  [<ffffffffb92bc210>] ?
>>>>>>>>>> wake_bit_function+0x40/0x40
>>>>>>>>>> [11159.499121]  [<ffffffffb9394154>]
__find_lock_page+0x54/0x70
>>>>>>>>>> [11159.499125]  [<ffffffffb9394e85>]
>>>>>>>>>> grab_cache_page_write_begin+0x55/0xc0
>>>>>>>>>> [11159.499130]  [<ffffffffb9484b76>]
iomap_write_begin+0x66/0x100
>>>>>>>>>> [11159.499135]  [<ffffffffb9484edf>]
iomap_write_actor+0xcf/0x1d0
>>>>>>>>>> [11159.499140]  [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>>>>>>> [11159.499144]  [<ffffffffb94854e7>]
iomap_apply+0xb7/0x150
>>>>>>>>>> [11159.499149]  [<ffffffffb9485621>]
>>>>>>>>>> iomap_file_buffered_write+0xa1/0xe0
>>>>>>>>>> [11159.499153]  [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>>>>>>> [11159.499182]  [<ffffffffc05f025d>]
>>>>>>>>>> xfs_file_buffered_aio_write+0x12d/0x2c0 [xfs]
>>>>>>>>>> [11159.499213]  [<ffffffffc05f057d>]
>>>>>>>>>> xfs_file_aio_write+0x18d/0x1b0 [xfs]
>>>>>>>>>> [11159.499217]  [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>>>>>>> [11159.499222]  [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0
>>>>>>>>>> [11159.499225]  [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>>>>>>> [11159.499230]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [11159.499234]  [<ffffffffb992082f>]
>>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>>> [11159.499238]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [11279.488720] INFO: task xfsaild/dm-10:1134
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [11279.490197] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [11279.491665] xfsaild/dm-10   D ffff9720a8660fd0
    0  1134
>>>>>>>>>>   2 0x00000000
>>>>>>>>>> [11279.491671] Call Trace:
>>>>>>>>>> [11279.491682]  [<ffffffffb92a3a2e>] ?
>>>>>>>>>> try_to_del_timer_sync+0x5e/0x90
>>>>>>>>>> [11279.491688]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [11279.491744]  [<ffffffffc060de36>]
_xfs_log_force+0x1c6/0x2c0
>>>>>>>>>> [xfs]
>>>>>>>>>> [11279.491750]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>>> [11279.491783]  [<ffffffffc0619fec>] ?
xfsaild+0x16c/0x6f0 [xfs]
>>>>>>>>>> [11279.491817]  [<ffffffffc060df5c>]
xfs_log_force+0x2c/0x70
>>>>>>>>>> [xfs]
>>>>>>>>>> [11279.491849]  [<ffffffffc0619e80>] ?
>>>>>>>>>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>>>>>>>>>> [11279.491880]  [<ffffffffc0619fec>]
xfsaild+0x16c/0x6f0 [xfs]
>>>>>>>>>> [11279.491913]  [<ffffffffc0619e80>] ?
>>>>>>>>>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>>>>>>>>>> [11279.491919]  [<ffffffffb92bb161>]
kthread+0xd1/0xe0
>>>>>>>>>> [11279.491926]  [<ffffffffb92bb090>] ?
>>>>>>>>>> insert_kthread_work+0x40/0x40
>>>>>>>>>> [11279.491932]  [<ffffffffb9920677>]
>>>>>>>>>> ret_from_fork_nospec_begin+0x21/0x21
>>>>>>>>>> [11279.491936]  [<ffffffffb92bb090>] ?
>>>>>>>>>> insert_kthread_work+0x40/0x40
>>>>>>>>>> [11279.491976] INFO: task glusterclogfsyn:14934
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [11279.493466] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [11279.494952] glusterclogfsyn D ffff97209832af70
    0 14934
>>>>>>>>>>   1 0x00000080
>>>>>>>>>> [11279.494957] Call Trace:
>>>>>>>>>> [11279.494979]  [<ffffffffc0309839>] ?
>>>>>>>>>> __split_and_process_bio+0x2e9/0x520 [dm_mod]
>>>>>>>>>> [11279.494983]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [11279.494987]  [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>>>>> [11279.494997]  [<ffffffffc0309d98>] ?
>>>>>>>>>> dm_make_request+0x128/0x1a0 [dm_mod]
>>>>>>>>>> [11279.495001]  [<ffffffffb991348d>]
>>>>>>>>>> io_schedule_timeout+0xad/0x130
>>>>>>>>>> [11279.495005]  [<ffffffffb99145ad>]
>>>>>>>>>> wait_for_completion_io+0xfd/0x140
>>>>>>>>>> [11279.495010]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>>> [11279.495016]  [<ffffffffb951e574>]
>>>>>>>>>> blkdev_issue_flush+0xb4/0x110
>>>>>>>>>> [11279.495049]  [<ffffffffc06064b9>]
>>>>>>>>>> xfs_blkdev_issue_flush+0x19/0x20 [xfs]
>>>>>>>>>> [11279.495079]  [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0
>>>>>>>>>> [xfs]
>>>>>>>>>> [11279.495086]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>>> [11279.495090]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [11279.495094]  [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>>>>>>> [11279.495098]  [<ffffffffb992082f>]
>>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>>> [11279.495102]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [11279.495105] INFO: task glusterposixfsy:14941
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [11279.496606] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [11279.498114] glusterposixfsy D ffff972495f84f10
    0 14941
>>>>>>>>>>   1 0x00000080
>>>>>>>>>> [11279.498118] Call Trace:
>>>>>>>>>> [11279.498134]  [<ffffffffc0309839>] ?
>>>>>>>>>> __split_and_process_bio+0x2e9/0x520 [dm_mod]
>>>>>>>>>> [11279.498138]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [11279.498142]  [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>>>>> [11279.498152]  [<ffffffffc0309d98>] ?
>>>>>>>>>> dm_make_request+0x128/0x1a0 [dm_mod]
>>>>>>>>>> [11279.498156]  [<ffffffffb991348d>]
>>>>>>>>>> io_schedule_timeout+0xad/0x130
>>>>>>>>>> [11279.498160]  [<ffffffffb99145ad>]
>>>>>>>>>> wait_for_completion_io+0xfd/0x140
>>>>>>>>>> [11279.498165]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>>> [11279.498169]  [<ffffffffb951e574>]
>>>>>>>>>> blkdev_issue_flush+0xb4/0x110
>>>>>>>>>> [11279.498202]  [<ffffffffc06064b9>]
>>>>>>>>>> xfs_blkdev_issue_flush+0x19/0x20 [xfs]
>>>>>>>>>> [11279.498231]  [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0
>>>>>>>>>> [xfs]
>>>>>>>>>> [11279.498238]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>>> [11279.498242]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [11279.498246]  [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>>>>>> [11279.498250]  [<ffffffffb992082f>]
>>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>>> [11279.498254]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [11279.498257] INFO: task glusteriotwr1:14950
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [11279.499789] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [11279.501343] glusteriotwr1   D ffff97208b6daf70
    0 14950
>>>>>>>>>>   1 0x00000080
>>>>>>>>>> [11279.501348] Call Trace:
>>>>>>>>>> [11279.501353]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [11279.501390]  [<ffffffffc060e388>]
>>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>>> [11279.501396]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>>> [11279.501428]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>>> [xfs]
>>>>>>>>>> [11279.501432]  [<ffffffffb944ef3f>]
generic_write_sync+0x4f/0x70
>>>>>>>>>> [11279.501461]  [<ffffffffc05f0545>]
>>>>>>>>>> xfs_file_aio_write+0x155/0x1b0 [xfs]
>>>>>>>>>> [11279.501466]  [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>>>>>>> [11279.501471]  [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0
>>>>>>>>>> [11279.501475]  [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>>>>>>> [11279.501479]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [11279.501483]  [<ffffffffb992082f>]
>>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>>> [11279.501489]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [11279.501493] INFO: task glusteriotwr4:14953
blocked for more
>>>>>>>>>> than 120 seconds.
>>>>>>>>>> [11279.503047] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>>> disables this message.
>>>>>>>>>> [11279.504630] glusteriotwr4   D ffff972499f2bf40
    0 14953
>>>>>>>>>>   1 0x00000080
>>>>>>>>>> [11279.504635] Call Trace:
>>>>>>>>>> [11279.504640]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>>>>> [11279.504676]  [<ffffffffc060e388>]
>>>>>>>>>> _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>>>>>>>>>> [11279.504681]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>>>>> [11279.504710]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>>>>> [xfs]
>>>>>>>>>> [11279.504714]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>>>>> [11279.504718]  [<ffffffffb992076f>] ?
>>>>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>>>>> [11279.504722]  [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>>>>>>> [11279.504725]  [<ffffffffb992082f>]
>>>>>>>>>> system_call_fastpath+0x1c/0x21
>>>>>>>>>> [11279.504730]  [<ffffffffb992077b>] ?
>>>>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>>>>> [12127.466494] perf: interrupt took too long
(8263 > 8150),
>>>>>>>>>> lowering kernel.perf_event_max_sample_rate to
24000
>>>>>>>>>>
>>>>>>>>>> --------------------
>>>>>>>>>> I think this is the cause of the massive ovirt
performance
>>>>>>>>>> issues irrespective of gluster volume.  At the
time this happened, I was
>>>>>>>>>> also ssh'ed into the host, and was doing some
rpm querry commands.  I had
>>>>>>>>>> just run rpm -qa |grep glusterfs (to verify what
version was actually
>>>>>>>>>> installed), and that command took almost 2
minutes to return!  Normally it
>>>>>>>>>> takes less than 2 seconds.  That is all pure
local SSD IO, too....
>>>>>>>>>>
>>>>>>>>>> I'm no expert, but its my understanding that
anytime a software
>>>>>>>>>> causes these kinds of issues, its a serious bug
in the software, even if
>>>>>>>>>> its mis-handled exceptions.  Is this correct?
>>>>>>>>>>
>>>>>>>>>> --Jim
>>>>>>>>>>
>>>>>>>>>> On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir
<
>>>>>>>>>> jim(a)palousetech.com&gt; wrote:
>>>>>>>>>>
>>>>>>>>>>> I think this is the profile information for
one of the volumes
>>>>>>>>>>> that lives on the SSDs and is fully
operational with no down/problem disks:
>>>>>>>>>>>
>>>>>>>>>>> [root@ovirt2 yum.repos.d]# gluster volume
profile data info
>>>>>>>>>>> Brick:
ovirt2.nwfiber.com:/gluster/brick2/data
>>>>>>>>>>>
----------------------------------------------
>>>>>>>>>>> Cumulative Stats:
>>>>>>>>>>>    Block Size:                256b+          
      512b+
>>>>>>>>>>>           1024b+
>>>>>>>>>>>  No. of Reads:                  983          
       2696
>>>>>>>>>>>             1059
>>>>>>>>>>> No. of Writes:                    0          
       1113
>>>>>>>>>>>              302
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:               2048b+          
     4096b+
>>>>>>>>>>>           8192b+
>>>>>>>>>>>  No. of Reads:                  852          
      88608
>>>>>>>>>>>            53526
>>>>>>>>>>> No. of Writes:                  522          
     812340
>>>>>>>>>>>            76257
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:              16384b+          
    32768b+
>>>>>>>>>>>          65536b+
>>>>>>>>>>>  No. of Reads:                54351          
     241901
>>>>>>>>>>>            15024
>>>>>>>>>>> No. of Writes:                21636          
       8656
>>>>>>>>>>>             8976
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:             131072b+
>>>>>>>>>>>  No. of Reads:               524156
>>>>>>>>>>> No. of Writes:               296071
>>>>>>>>>>>  %-latency   Avg-latency   Min-Latency  
Max-Latency   No. of
>>>>>>>>>>> calls         Fop
>>>>>>>>>>>  ---------   -----------   -----------  
-----------
>>>>>>>>>>>  ------------        ----
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  4189     RELEASE
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  1257  RELEASEDIR
>>>>>>>>>>>       0.00      46.19 us      12.00 us    
187.00 us
>>>>>>>>>>>  69       FLUSH
>>>>>>>>>>>       0.00     147.00 us      78.00 us    
367.00 us
>>>>>>>>>>>  86 REMOVEXATTR
>>>>>>>>>>>       0.00     223.46 us      24.00 us   
1166.00 us
>>>>>>>>>>> 149     READDIR
>>>>>>>>>>>       0.00     565.34 us      76.00 us   
3639.00 us
>>>>>>>>>>>  88   FTRUNCATE
>>>>>>>>>>>       0.00     263.28 us      20.00 us  
28385.00 us
>>>>>>>>>>> 228          LK
>>>>>>>>>>>       0.00      98.84 us       2.00 us    
880.00 us
>>>>>>>>>>>  1198     OPENDIR
>>>>>>>>>>>       0.00      91.59 us      26.00 us  
10371.00 us
>>>>>>>>>>>  3853      STATFS
>>>>>>>>>>>       0.00     494.14 us      17.00 us 
193439.00 us
>>>>>>>>>>>  1171    GETXATTR
>>>>>>>>>>>       0.00     299.42 us      35.00 us   
9799.00 us
>>>>>>>>>>>  2044    READDIRP
>>>>>>>>>>>       0.00    1965.31 us     110.00 us 
382258.00 us
>>>>>>>>>>> 321     XATTROP
>>>>>>>>>>>       0.01     113.40 us      24.00 us  
61061.00 us
>>>>>>>>>>>  8134        STAT
>>>>>>>>>>>       0.01     755.38 us      57.00 us 
607603.00 us
>>>>>>>>>>>  3196     DISCARD
>>>>>>>>>>>       0.05    2690.09 us      58.00 us
2704761.00 us
>>>>>>>>>>>  3206        OPEN
>>>>>>>>>>>       0.10  119978.25 us      97.00 us
9406684.00 us
>>>>>>>>>>> 154     SETATTR
>>>>>>>>>>>       0.18     101.73 us      28.00 us 
700477.00 us
>>>>>>>>>>>  313379       FSTAT
>>>>>>>>>>>       0.23    1059.84 us      25.00 us
2716124.00 us
>>>>>>>>>>> 38255      LOOKUP
>>>>>>>>>>>       0.47    1024.11 us      54.00 us
6197164.00 us
>>>>>>>>>>> 81455    FXATTROP
>>>>>>>>>>>       1.72    2984.00 us      15.00 us
37098954.00 us
>>>>>>>>>>>  103020    FINODELK
>>>>>>>>>>>       5.92   44315.32 us      51.00 us
24731536.00 us
>>>>>>>>>>> 23957       FSYNC
>>>>>>>>>>>      13.27    2399.78 us      25.00 us
22089540.00 us
>>>>>>>>>>>  991005        READ
>>>>>>>>>>>      37.00    5980.43 us      52.00 us
22099889.00 us
>>>>>>>>>>> 1108976       WRITE
>>>>>>>>>>>      41.04    5452.75 us      13.00 us
22102452.00 us
>>>>>>>>>>> 1349053     INODELK
>>>>>>>>>>>
>>>>>>>>>>>     Duration: 10026 seconds
>>>>>>>>>>>    Data Read: 80046027759 bytes
>>>>>>>>>>> Data Written: 44496632320 bytes
>>>>>>>>>>>
>>>>>>>>>>> Interval 1 Stats:
>>>>>>>>>>>    Block Size:                256b+          
      512b+
>>>>>>>>>>>           1024b+
>>>>>>>>>>>  No. of Reads:                  983          
       2696
>>>>>>>>>>>             1059
>>>>>>>>>>> No. of Writes:                    0          
        838
>>>>>>>>>>>              185
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:               2048b+          
     4096b+
>>>>>>>>>>>           8192b+
>>>>>>>>>>>  No. of Reads:                  852          
      85856
>>>>>>>>>>>            51575
>>>>>>>>>>> No. of Writes:                  382          
     705802
>>>>>>>>>>>            57812
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:              16384b+          
    32768b+
>>>>>>>>>>>          65536b+
>>>>>>>>>>>  No. of Reads:                52673          
     232093
>>>>>>>>>>>            14984
>>>>>>>>>>> No. of Writes:                13499          
       4908
>>>>>>>>>>>             4242
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:             131072b+
>>>>>>>>>>>  No. of Reads:               460040
>>>>>>>>>>> No. of Writes:                 6411
>>>>>>>>>>>  %-latency   Avg-latency   Min-Latency  
Max-Latency   No. of
>>>>>>>>>>> calls         Fop
>>>>>>>>>>>  ---------   -----------   -----------  
-----------
>>>>>>>>>>>  ------------        ----
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  2093     RELEASE
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  1093  RELEASEDIR
>>>>>>>>>>>       0.00      53.38 us      26.00 us    
111.00 us
>>>>>>>>>>>  16       FLUSH
>>>>>>>>>>>       0.00     145.14 us      78.00 us    
367.00 us
>>>>>>>>>>>  71 REMOVEXATTR
>>>>>>>>>>>       0.00     190.96 us     114.00 us    
298.00 us
>>>>>>>>>>>  71     SETATTR
>>>>>>>>>>>       0.00     213.38 us      24.00 us   
1145.00 us
>>>>>>>>>>>  90     READDIR
>>>>>>>>>>>       0.00     263.28 us      20.00 us  
28385.00 us
>>>>>>>>>>> 228          LK
>>>>>>>>>>>       0.00     101.76 us       2.00 us    
880.00 us
>>>>>>>>>>>  1093     OPENDIR
>>>>>>>>>>>       0.01      93.60 us      27.00 us  
10371.00 us
>>>>>>>>>>>  3090      STATFS
>>>>>>>>>>>       0.02     537.47 us      17.00 us 
193439.00 us
>>>>>>>>>>>  1038    GETXATTR
>>>>>>>>>>>       0.03     297.44 us      35.00 us   
9799.00 us
>>>>>>>>>>>  1990    READDIRP
>>>>>>>>>>>       0.03    2357.28 us     110.00 us 
382258.00 us
>>>>>>>>>>> 253     XATTROP
>>>>>>>>>>>       0.04     385.93 us      58.00 us  
47593.00 us
>>>>>>>>>>>  2091        OPEN
>>>>>>>>>>>       0.04     114.86 us      24.00 us  
61061.00 us
>>>>>>>>>>>  7715        STAT
>>>>>>>>>>>       0.06     444.59 us      57.00 us 
333240.00 us
>>>>>>>>>>>  3053     DISCARD
>>>>>>>>>>>       0.42     316.24 us      25.00 us 
290728.00 us
>>>>>>>>>>> 29823      LOOKUP
>>>>>>>>>>>       0.73     257.92 us      54.00 us 
344812.00 us
>>>>>>>>>>> 63296    FXATTROP
>>>>>>>>>>>       1.37      98.30 us      28.00 us  
67621.00 us
>>>>>>>>>>>  313172       FSTAT
>>>>>>>>>>>       1.58    2124.69 us      51.00 us 
849200.00 us
>>>>>>>>>>> 16717       FSYNC
>>>>>>>>>>>       5.73     162.46 us      52.00 us 
748492.00 us
>>>>>>>>>>>  794079       WRITE
>>>>>>>>>>>       7.19    2065.17 us      16.00 us
37098954.00 us
>>>>>>>>>>> 78381    FINODELK
>>>>>>>>>>>      36.44     886.32 us      25.00 us
2216436.00 us
>>>>>>>>>>>  925421        READ
>>>>>>>>>>>      46.30    1178.04 us      13.00 us
1700704.00 us
>>>>>>>>>>>  884635     INODELK
>>>>>>>>>>>
>>>>>>>>>>>     Duration: 7485 seconds
>>>>>>>>>>>    Data Read: 71250527215 bytes
>>>>>>>>>>> Data Written: 5119903744 bytes
>>>>>>>>>>>
>>>>>>>>>>> Brick:
ovirt3.nwfiber.com:/gluster/brick2/data
>>>>>>>>>>>
----------------------------------------------
>>>>>>>>>>> Cumulative Stats:
>>>>>>>>>>>    Block Size:                  1b+
>>>>>>>>>>>  No. of Reads:                    0
>>>>>>>>>>> No. of Writes:              3264419
>>>>>>>>>>>  %-latency   Avg-latency   Min-Latency  
Max-Latency   No. of
>>>>>>>>>>> calls         Fop
>>>>>>>>>>>  ---------   -----------   -----------  
-----------
>>>>>>>>>>>  ------------        ----
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  90      FORGET
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  9462     RELEASE
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  4254  RELEASEDIR
>>>>>>>>>>>       0.00      50.52 us      13.00 us    
190.00 us
>>>>>>>>>>>  71       FLUSH
>>>>>>>>>>>       0.00     186.97 us      87.00 us    
713.00 us
>>>>>>>>>>>  86 REMOVEXATTR
>>>>>>>>>>>       0.00      79.32 us      33.00 us    
189.00 us
>>>>>>>>>>> 228          LK
>>>>>>>>>>>       0.00     220.98 us     129.00 us    
513.00 us
>>>>>>>>>>>  86     SETATTR
>>>>>>>>>>>       0.01     259.30 us      26.00 us   
2632.00 us
>>>>>>>>>>> 137     READDIR
>>>>>>>>>>>       0.02     322.76 us     145.00 us   
2125.00 us
>>>>>>>>>>> 321     XATTROP
>>>>>>>>>>>       0.03     109.55 us       2.00 us   
1258.00 us
>>>>>>>>>>>  1193     OPENDIR
>>>>>>>>>>>       0.05      70.21 us      21.00 us    
431.00 us
>>>>>>>>>>>  3196     DISCARD
>>>>>>>>>>>       0.05     169.26 us      21.00 us   
2315.00 us
>>>>>>>>>>>  1545    GETXATTR
>>>>>>>>>>>       0.12     176.85 us      63.00 us   
2844.00 us
>>>>>>>>>>>  3206        OPEN
>>>>>>>>>>>       0.61     303.49 us      90.00 us   
3085.00 us
>>>>>>>>>>>  9633       FSTAT
>>>>>>>>>>>       2.44     305.66 us      28.00 us   
3716.00 us
>>>>>>>>>>> 38230      LOOKUP
>>>>>>>>>>>       4.52     266.22 us      55.00 us  
53424.00 us
>>>>>>>>>>> 81455    FXATTROP
>>>>>>>>>>>       6.96    1397.99 us      51.00 us  
64822.00 us
>>>>>>>>>>> 23889       FSYNC
>>>>>>>>>>>      16.48      84.74 us      25.00 us   
6917.00 us
>>>>>>>>>>>  932592       WRITE
>>>>>>>>>>>      30.16     106.90 us      13.00 us
3920189.00 us
>>>>>>>>>>> 1353046     INODELK
>>>>>>>>>>>      38.55    1794.52 us      14.00 us
16210553.00 us
>>>>>>>>>>>  103039    FINODELK
>>>>>>>>>>>
>>>>>>>>>>>     Duration: 66562 seconds
>>>>>>>>>>>    Data Read: 0 bytes
>>>>>>>>>>> Data Written: 3264419 bytes
>>>>>>>>>>>
>>>>>>>>>>> Interval 1 Stats:
>>>>>>>>>>>    Block Size:                  1b+
>>>>>>>>>>>  No. of Reads:                    0
>>>>>>>>>>> No. of Writes:               794080
>>>>>>>>>>>  %-latency   Avg-latency   Min-Latency  
Max-Latency   No. of
>>>>>>>>>>> calls         Fop
>>>>>>>>>>>  ---------   -----------   -----------  
-----------
>>>>>>>>>>>  ------------        ----
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  2093     RELEASE
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  1093  RELEASEDIR
>>>>>>>>>>>       0.00      70.31 us      26.00 us    
125.00 us
>>>>>>>>>>>  16       FLUSH
>>>>>>>>>>>       0.00     193.10 us     103.00 us    
713.00 us
>>>>>>>>>>>  71 REMOVEXATTR
>>>>>>>>>>>       0.01     227.32 us     133.00 us    
513.00 us
>>>>>>>>>>>  71     SETATTR
>>>>>>>>>>>       0.01      79.32 us      33.00 us    
189.00 us
>>>>>>>>>>> 228          LK
>>>>>>>>>>>       0.01     259.83 us      35.00 us   
1138.00 us
>>>>>>>>>>>  89     READDIR
>>>>>>>>>>>       0.03     318.26 us     145.00 us   
2047.00 us
>>>>>>>>>>> 253     XATTROP
>>>>>>>>>>>       0.04     112.67 us       3.00 us   
1258.00 us
>>>>>>>>>>>  1093     OPENDIR
>>>>>>>>>>>       0.06     167.98 us      23.00 us   
1951.00 us
>>>>>>>>>>>  1014    GETXATTR
>>>>>>>>>>>       0.08      70.97 us      22.00 us    
431.00 us
>>>>>>>>>>>  3053     DISCARD
>>>>>>>>>>>       0.13     183.78 us      66.00 us   
2844.00 us
>>>>>>>>>>>  2091        OPEN
>>>>>>>>>>>       1.01     303.82 us      90.00 us   
3085.00 us
>>>>>>>>>>>  9610       FSTAT
>>>>>>>>>>>       3.27     316.59 us      30.00 us   
3716.00 us
>>>>>>>>>>> 29820      LOOKUP
>>>>>>>>>>>       5.83     265.79 us      59.00 us  
53424.00 us
>>>>>>>>>>> 63296    FXATTROP
>>>>>>>>>>>       7.95    1373.89 us      51.00 us  
64822.00 us
>>>>>>>>>>> 16717       FSYNC
>>>>>>>>>>>      23.17     851.99 us      14.00 us
16210553.00 us
>>>>>>>>>>> 78555    FINODELK
>>>>>>>>>>>      24.04      87.44 us      27.00 us   
6917.00 us
>>>>>>>>>>>  794081       WRITE
>>>>>>>>>>>      34.36     111.91 us      14.00 us 
984871.00 us
>>>>>>>>>>>  886790     INODELK
>>>>>>>>>>>
>>>>>>>>>>>     Duration: 7485 seconds
>>>>>>>>>>>    Data Read: 0 bytes
>>>>>>>>>>> Data Written: 794080 bytes
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -----------------------
>>>>>>>>>>> Here is the data from the volume that is
backed by the SHDDs
>>>>>>>>>>> and has one failed disk:
>>>>>>>>>>> [root@ovirt2 yum.repos.d]# gluster volume
profile data-hdd info
>>>>>>>>>>> Brick: 172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>>>>> --------------------------------------------
>>>>>>>>>>> Cumulative Stats:
>>>>>>>>>>>    Block Size:                256b+          
      512b+
>>>>>>>>>>>           1024b+
>>>>>>>>>>>  No. of Reads:                 1702          
         86
>>>>>>>>>>>               16
>>>>>>>>>>> No. of Writes:                    0          
        767
>>>>>>>>>>>               71
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:               2048b+          
     4096b+
>>>>>>>>>>>           8192b+
>>>>>>>>>>>  No. of Reads:                   19          
      51841
>>>>>>>>>>>             2049
>>>>>>>>>>> No. of Writes:                   76          
      60668
>>>>>>>>>>>            35727
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:              16384b+          
    32768b+
>>>>>>>>>>>          65536b+
>>>>>>>>>>>  No. of Reads:                 1744          
        639
>>>>>>>>>>>             1088
>>>>>>>>>>> No. of Writes:                 8524          
       2410
>>>>>>>>>>>             1285
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:             131072b+
>>>>>>>>>>>  No. of Reads:               771999
>>>>>>>>>>> No. of Writes:                29584
>>>>>>>>>>>  %-latency   Avg-latency   Min-Latency  
Max-Latency   No. of
>>>>>>>>>>> calls         Fop
>>>>>>>>>>>  ---------   -----------   -----------  
-----------
>>>>>>>>>>>  ------------        ----
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  2902     RELEASE
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  1517  RELEASEDIR
>>>>>>>>>>>       0.00     197.00 us     197.00 us    
197.00 us
>>>>>>>>>>>   1   FTRUNCATE
>>>>>>>>>>>       0.00      70.24 us      16.00 us    
758.00 us
>>>>>>>>>>>  51       FLUSH
>>>>>>>>>>>       0.00     143.93 us      82.00 us    
305.00 us
>>>>>>>>>>>  57 REMOVEXATTR
>>>>>>>>>>>       0.00     178.63 us     105.00 us    
712.00 us
>>>>>>>>>>>  60     SETATTR
>>>>>>>>>>>       0.00      67.30 us      19.00 us    
572.00 us
>>>>>>>>>>> 555          LK
>>>>>>>>>>>       0.00     322.80 us      23.00 us   
4673.00 us
>>>>>>>>>>> 138     READDIR
>>>>>>>>>>>       0.00     336.56 us     106.00 us  
11994.00 us
>>>>>>>>>>> 237     XATTROP
>>>>>>>>>>>       0.00      84.70 us      28.00 us   
1071.00 us
>>>>>>>>>>>  3469      STATFS
>>>>>>>>>>>       0.01     387.75 us       2.00 us 
146017.00 us
>>>>>>>>>>>  1467     OPENDIR
>>>>>>>>>>>       0.01     148.59 us      21.00 us  
64374.00 us
>>>>>>>>>>>  4454        STAT
>>>>>>>>>>>       0.02     783.02 us      16.00 us  
93502.00 us
>>>>>>>>>>>  1902    GETXATTR
>>>>>>>>>>>       0.03    1516.10 us      17.00 us 
210690.00 us
>>>>>>>>>>>  1364     ENTRYLK
>>>>>>>>>>>       0.03    2555.47 us     300.00 us 
674454.00 us
>>>>>>>>>>>  1064    READDIRP
>>>>>>>>>>>       0.07      85.74 us      19.00 us  
68340.00 us
>>>>>>>>>>> 62849       FSTAT
>>>>>>>>>>>       0.07    1978.12 us      59.00 us 
202596.00 us
>>>>>>>>>>>  2729        OPEN
>>>>>>>>>>>       0.22     708.57 us      15.00 us 
394799.00 us
>>>>>>>>>>> 25447      LOOKUP
>>>>>>>>>>>       5.94    2331.74 us      15.00 us
1099530.00 us
>>>>>>>>>>>  207534    FINODELK
>>>>>>>>>>>       7.31    8311.75 us      58.00 us
1800216.00 us
>>>>>>>>>>> 71668    FXATTROP
>>>>>>>>>>>      12.49    7735.19 us      51.00 us
3595513.00 us
>>>>>>>>>>>  131642       WRITE
>>>>>>>>>>>      17.70     957.08 us      16.00 us
13700466.00 us
>>>>>>>>>>> 1508160     INODELK
>>>>>>>>>>>      24.55    2546.43 us      26.00 us
5077347.00 us
>>>>>>>>>>>  786060        READ
>>>>>>>>>>>      31.56   49699.15 us      47.00 us
3746331.00 us
>>>>>>>>>>> 51777       FSYNC
>>>>>>>>>>>
>>>>>>>>>>>     Duration: 10101 seconds
>>>>>>>>>>>    Data Read: 101562897361 bytes
>>>>>>>>>>> Data Written: 4834450432 bytes
>>>>>>>>>>>
>>>>>>>>>>> Interval 0 Stats:
>>>>>>>>>>>    Block Size:                256b+          
      512b+
>>>>>>>>>>>           1024b+
>>>>>>>>>>>  No. of Reads:                 1702          
         86
>>>>>>>>>>>               16
>>>>>>>>>>> No. of Writes:                    0          
        767
>>>>>>>>>>>               71
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:               2048b+          
     4096b+
>>>>>>>>>>>           8192b+
>>>>>>>>>>>  No. of Reads:                   19          
      51841
>>>>>>>>>>>             2049
>>>>>>>>>>> No. of Writes:                   76          
      60668
>>>>>>>>>>>            35727
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:              16384b+          
    32768b+
>>>>>>>>>>>          65536b+
>>>>>>>>>>>  No. of Reads:                 1744          
        639
>>>>>>>>>>>             1088
>>>>>>>>>>> No. of Writes:                 8524          
       2410
>>>>>>>>>>>             1285
>>>>>>>>>>>
>>>>>>>>>>>    Block Size:             131072b+
>>>>>>>>>>>  No. of Reads:               771999
>>>>>>>>>>> No. of Writes:                29584
>>>>>>>>>>>  %-latency   Avg-latency   Min-Latency  
Max-Latency   No. of
>>>>>>>>>>> calls         Fop
>>>>>>>>>>>  ---------   -----------   -----------  
-----------
>>>>>>>>>>>  ------------        ----
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  2902     RELEASE
>>>>>>>>>>>       0.00       0.00 us       0.00 us      
0.00 us
>>>>>>>>>>>  1517  RELEASEDIR
>>>>>>>>>>>       0.00     197.00 us     197.00 us    
197.00 us
>>>>>>>>>>>   1   FTRUNCATE
>>>>>>>>>>>       0.00      70.24 us      16.00 us    
758.00 us
>>>>>>>>>>>  51       FLUSH
>>>>>>>>>>>       0.00     143.93 us      82.00 us    
305.00 us
>>>>>>>>>>>  57 REMOVEXATTR
>>>>>>>>>>>       0.00     178.63 us     105.00 us    
712.00 us
>>>>>>>>>>>  60     SETATTR
>>>>>>>>>>>       0.00      67.30 us      19.00 us    
572.00 us
>>>>>>>>>>> 555          LK
>>>>>>>>>>>       0.00     322.80 us      23.00 us   
4673.00 us
>>>>>>>>>>> 138     READDIR
>>>>>>>>>>>       0.00     336.56 us     106.00 us  
11994.00 us
>>>>>>>>>>> 237     XATTROP
>>>>>>>>>>>       0.00      84.70 us      28.00 us   
1071.00 us
>>>>>>>>>>>  3469      STATFS
>>>>>>>>>>>       0.01     387.75 us       2.00 us 
146017.00 us
>>>>>>>>>>>  1467     OPENDIR
>>>>>>>>>>>       0.01     148.59 us      21.00 us  
64374.00 us
>>>>>>>>>>>  4454        STAT
>>>>>>>>>>>       0.02     783.02 us      16.00 us  
93502.00 us
>>>>>>>>>>>  1902    GETXATTR
>>>>>>>>>>>       0.03    1516.10 us      17.00 us 
210690.00 us
>>>>>>>>>>>  1364     ENTRYLK
>>>>>>>>>>>       0.03    2555.47 us     300.00 us 
674454.00 us
>>>>>>>>>>>  1064    READDIRP
>>>>>>>>>>>       0.07      85.73 us      19.00 us  
68340.00 us
>>>>>>>>>>> 62849       FSTAT
>>>>>>>>>>>       0.07    1978.12 us      59.00 us 
202596.00 us
>>>>>>>>>>>  2729        OPEN
>>>>>>>>>>>       0.22     708.57 us      15.00 us 
394799.00 us
>>>>>>>>>>> 25447      LOOKUP
>>>>>>>>>>>       5.94    2334.57 us      15.00 us
1099530.00 us
>>>>>>>>>>>  207534    FINODELK
>>>>>>>>>>>       7.31    8311.49 us      58.00 us
1800216.00 us
>>>>>>>>>>> 71668    FXATTROP
>>>>>>>>>>>      12.49    7735.32 us      51.00 us
3595513.00 us
>>>>>>>>>>>  131642       WRITE
>>>>>>>>>>>      17.71     957.08 us      16.00 us
13700466.00 us
>>>>>>>>>>> 1508160     INODELK
>>>>>>>>>>>      24.56    2546.42 us      26.00 us
5077347.00 us
>>>>>>>>>>>  786060        READ
>>>>>>>>>>>      31.54   49651.63 us      47.00 us
3746331.00 us
>>>>>>>>>>> 51777       FSYNC
>>>>>>>>>>>
>>>>>>>>>>>     Duration: 10101 seconds
>>>>>>>>>>>    Data Read: 101562897361 bytes
>>>>>>>>>>> Data Written: 4834450432 bytes
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir
<
>>>>>>>>>>> jim(a)palousetech.com&gt; wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you for your response.
>>>>>>>>>>>>
>>>>>>>>>>>> I have 4 gluster volumes.  3 are replica
2 + arbitrator.
>>>>>>>>>>>> replica bricks are on ovirt1 and ovirt2,
arbitrator on ovirt3.  The 4th
>>>>>>>>>>>> volume is replica 3, with a brick on all
three ovirt machines.
>>>>>>>>>>>>
>>>>>>>>>>>> The first 3 volumes are on an SSD disk;
the 4th is on a
>>>>>>>>>>>> Seagate SSHD (same in all three
machines).  On ovirt3, the SSHD has
>>>>>>>>>>>> reported hard IO failures, and that brick
is offline.  However, the other
>>>>>>>>>>>> two replicas are fully operational
(although they still show contents in
>>>>>>>>>>>> the heal info command that won't go
away, but that may be the case until I
>>>>>>>>>>>> replace the failed disk).
>>>>>>>>>>>>
>>>>>>>>>>>> What is bothering me is that ALL 4
gluster volumes are showing
>>>>>>>>>>>> horrible performance issues.  At this
point, as the bad disk has been
>>>>>>>>>>>> completely offlined, I would expect
gluster to perform at normal speed, but
>>>>>>>>>>>> that is definitely not the case.
>>>>>>>>>>>>
>>>>>>>>>>>> I've also noticed that the
performance hits seem to come in
>>>>>>>>>>>> waves: things seem to work acceptably
(but slow) for a while, then
>>>>>>>>>>>> suddenly, its as if all disk IO on all
volumes (including non-gluster local
>>>>>>>>>>>> OS disk volumes for the hosts) pause for
about 30 seconds, then IO resumes
>>>>>>>>>>>> again.  During those times, I start
getting VM not responding and host not
>>>>>>>>>>>> responding notices as well as the
applications having major issues.
>>>>>>>>>>>>
>>>>>>>>>>>> I've shut down most of my VMs and am
down to just my essential
>>>>>>>>>>>> core VMs (shedded about 75% of my VMs). 
I still am experiencing the same
>>>>>>>>>>>> issues.
>>>>>>>>>>>>
>>>>>>>>>>>> Am I correct in believing that once the
failed disk was
>>>>>>>>>>>> brought offline that performance should
return to normal?
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 29, 2018 at 1:27 PM, Alex K
<
>>>>>>>>>>>> rightkicktech(a)gmail.com&gt; wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I would check disks status and
accessibility of mount points
>>>>>>>>>>>>> where your gluster volumes reside.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 29, 2018, 22:28 Jim
Kusznir <jim(a)palousetech.com&gt;
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On one ovirt server, I'm now
seeing these messages:
>>>>>>>>>>>>>> [56474.239725]
blk_update_request: 63 callbacks suppressed
>>>>>>>>>>>>>> [56474.239732]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 0
>>>>>>>>>>>>>> [56474.240602]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 3905945472
>>>>>>>>>>>>>> [56474.241346]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 3905945584
>>>>>>>>>>>>>> [56474.242236]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 2048
>>>>>>>>>>>>>> [56474.243072]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 3905943424
>>>>>>>>>>>>>> [56474.243997]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 3905943536
>>>>>>>>>>>>>> [56474.247347]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 0
>>>>>>>>>>>>>> [56474.248315]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 3905945472
>>>>>>>>>>>>>> [56474.249231]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 3905945584
>>>>>>>>>>>>>> [56474.250221]
blk_update_request: I/O error, dev dm-2,
>>>>>>>>>>>>>> sector 2048
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, May 29, 2018 at 11:59 AM,
Jim Kusznir <
>>>>>>>>>>>>>> jim(a)palousetech.com&gt; wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see in messages on ovirt3
(my 3rd machine, the one
>>>>>>>>>>>>>>> upgraded to 4.2):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> May 29 11:54:41 ovirt3
ovs-vsctl:
>>>>>>>>>>>>>>>
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>>>>>> database connection failed
(No such file or directory)
>>>>>>>>>>>>>>> May 29 11:54:51 ovirt3
ovs-vsctl:
>>>>>>>>>>>>>>>
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>>>>>> database connection failed
(No such file or directory)
>>>>>>>>>>>>>>> May 29 11:55:01 ovirt3
ovs-vsctl:
>>>>>>>>>>>>>>>
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>>>>>> database connection failed
(No such file or directory)
>>>>>>>>>>>>>>> (appears a lot).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also found on the ssh
session of that, some sysv warnings
>>>>>>>>>>>>>>> about the backing disk for
one of the gluster volumes (straight replica
>>>>>>>>>>>>>>> 3).  The glusterfs process
for that disk on that machine went offline.  Its
>>>>>>>>>>>>>>> my understanding that it
should continue to work with the other two
>>>>>>>>>>>>>>> machines while I attempt to
replace that disk, right?  Attempted writes
>>>>>>>>>>>>>>> (touching an empty file) can
take 15 seconds, repeating it later will be
>>>>>>>>>>>>>>> much faster.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gluster generates a bunch of
different log files, I don't
>>>>>>>>>>>>>>> know what ones you want, or
from which machine(s).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> How do I do "volume
profiling"?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, May 29, 2018 at 11:53
AM, Sahina Bose <
>>>>>>>>>>>>>>> sabose(a)redhat.com&gt; wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do you see errors
reported in the mount logs for the
>>>>>>>>>>>>>>>> volume? If so, could you
attach the logs?
>>>>>>>>>>>>>>>> Any issues with your
underlying disks. Can you also attach
>>>>>>>>>>>>>>>> output of volume
profiling?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, May 30, 2018 at
12:13 AM, Jim Kusznir <
>>>>>>>>>>>>>>>> jim(a)palousetech.com&gt;
wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ok, things have
gotten MUCH worse this morning.  I'm
>>>>>>>>>>>>>>>>> getting random errors
from VMs, right now, about a third of my VMs have
>>>>>>>>>>>>>>>>> been paused due to
storage issues, and most of the remaining VMs are not
>>>>>>>>>>>>>>>>> performing well.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> At this point, I am
in full EMERGENCY mode, as my
>>>>>>>>>>>>>>>>> production services
are now impacted, and I'm getting calls coming in with
>>>>>>>>>>>>>>>>> problems...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'd greatly
appreciate help...VMs are running VERY slowly
>>>>>>>>>>>>>>>>> (when they run), and
they are steadily getting worse.  I don't know why.  I
>>>>>>>>>>>>>>>>> was seeing CPU peaks
(to 100%) on several VMs, in perfect sync, for a few
>>>>>>>>>>>>>>>>> minutes at a time
(while the VM became unresponsive and any VMs I was
>>>>>>>>>>>>>>>>> logged into that were
linux were giving me the CPU stuck messages in my
>>>>>>>>>>>>>>>>> origional post).  Is
all this storage related?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I also have two
different gluster volumes for VM storage,
>>>>>>>>>>>>>>>>> and only one had the
issues, but now VMs in both are being affected at the
>>>>>>>>>>>>>>>>> same time and same
way.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, May 28, 2018
at 10:50 PM, Sahina Bose <
>>>>>>>>>>>>>>>>> sabose(a)redhat.com&gt;
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [Adding
gluster-users to look at the heal issue]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, May 29,
2018 at 9:17 AM, Jim Kusznir <
>>>>>>>>>>>>>>>>>>
jim(a)palousetech.com&gt; wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I've been
having some cluster and gluster performance
>>>>>>>>>>>>>>>>>>> issues
lately.  I also found that my cluster was out of date, and was
>>>>>>>>>>>>>>>>>>> trying to
apply updates (hoping to fix some of these), and discovered the
>>>>>>>>>>>>>>>>>>> ovirt 4.1
repos were taken completely offline.  So, I was forced to begin
>>>>>>>>>>>>>>>>>>> an upgrade to
4.2.  According to docs I found/read, I needed only add the
>>>>>>>>>>>>>>>>>>> new repo, do
a yum update, reboot, and be good on my hosts (did the yum
>>>>>>>>>>>>>>>>>>> update, the
engine-setup on my hosted engine).  Things seemed to work
>>>>>>>>>>>>>>>>>>> relatively
well, except for a gluster sync issue that showed up.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> My cluster is
a 3 node hyperconverged cluster.  I
>>>>>>>>>>>>>>>>>>> upgraded the
hosted engine first, then engine 3.  When engine 3 came back
>>>>>>>>>>>>>>>>>>> up, for some
reason one of my gluster volumes would not sync.  Here's
>>>>>>>>>>>>>>>>>>> sample
output:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [root@ovirt3
~]# gluster volume heal data-hdd info
>>>>>>>>>>>>>>>>>>> Brick
172.172.1.11:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0
>>>>>>>>>>>>>>>>>>>
-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300
>>>>>>>>>>>>>>>>>>>
-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f
>>>>>>>>>>>>>>>>>>>
-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470
>>>>>>>>>>>>>>>>>>>
-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad
>>>>>>>>>>>>>>>>>>>
-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a
>>>>>>>>>>>>>>>>>>>
-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231
>>>>>>>>>>>>>>>>>>>
-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69
>>>>>>>>>>>>>>>>>>>
-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>>>>> Status:
Connected
>>>>>>>>>>>>>>>>>>> Number of
entries: 8
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Brick
172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0
>>>>>>>>>>>>>>>>>>>
-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300
>>>>>>>>>>>>>>>>>>>
-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a
>>>>>>>>>>>>>>>>>>>
-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f
>>>>>>>>>>>>>>>>>>>
-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470
>>>>>>>>>>>>>>>>>>>
-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231
>>>>>>>>>>>>>>>>>>>
-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69
>>>>>>>>>>>>>>>>>>>
-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad
>>>>>>>>>>>>>>>>>>>
-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>>>>> Status:
Connected
>>>>>>>>>>>>>>>>>>> Number of
entries: 8
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Brick
172.172.1.13:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a
>>>>>>>>>>>>>>>>>>>
-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231
>>>>>>>>>>>>>>>>>>>
-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69
>>>>>>>>>>>>>>>>>>>
-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad
>>>>>>>>>>>>>>>>>>>
-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300
>>>>>>>>>>>>>>>>>>>
-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0
>>>>>>>>>>>>>>>>>>>
-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f
>>>>>>>>>>>>>>>>>>>
-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470
>>>>>>>>>>>>>>>>>>>
-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>>>>> Status:
Connected
>>>>>>>>>>>>>>>>>>> Number of
entries: 8
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ---------
>>>>>>>>>>>>>>>>>>> Its been in
this state for a couple days now, and
>>>>>>>>>>>>>>>>>>> bandwidth
monitoring shows no appreciable data moving.  I've tried
>>>>>>>>>>>>>>>>>>> repeatedly
commanding a full heal from all three clusters in the node.  Its
>>>>>>>>>>>>>>>>>>> always the
same files that need healing.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> When running
gluster volume heal data-hdd statistics, I
>>>>>>>>>>>>>>>>>>> see sometimes
different information, but always some number of "heal
>>>>>>>>>>>>>>>>>>> failed"
entries.  It shows 0 for split brain.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm not
quite sure what to do.  I suspect it may be due
>>>>>>>>>>>>>>>>>>> to nodes 1
and 2 still being on the older ovirt/gluster release, but I'm
>>>>>>>>>>>>>>>>>>> afraid to
upgrade and reboot them until I have a good gluster sync (don't
>>>>>>>>>>>>>>>>>>> need to
create a split brain issue).  How do I proceed with this?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Second issue:
I've been experiencing VERY POOR
>>>>>>>>>>>>>>>>>>> performance
on most of my VMs.  To the tune that logging into a windows 10
>>>>>>>>>>>>>>>>>>> vm via remote
desktop can take 5 minutes, launching quickbooks inside said
>>>>>>>>>>>>>>>>>>> vm can easily
take 10 minutes.  On some linux VMs, I get random messages
>>>>>>>>>>>>>>>>>>> like this:
>>>>>>>>>>>>>>>>>>> Message from
syslogd@unifi at May 28 20:39:23 ...
>>>>>>>>>>>>>>>>>>> 
kernel:[6171996.308904] NMI watchdog: BUG: soft lockup
>>>>>>>>>>>>>>>>>>> - CPU#0 stuck
for 22s! [mongod:14766]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (the process
and PID are often different)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm not
quite sure what to do about this either.  My
>>>>>>>>>>>>>>>>>>> initial
thought was upgrad everything to current and see if its still
>>>>>>>>>>>>>>>>>>> there, but I
cannot move forward with that until my gluster is healed...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>>>>>>> Users mailing
list -- users(a)ovirt.org
>>>>>>>>>>>>>>>>>>> To
unsubscribe send an email to users-leave(a)ovirt.org
>>>>>>>>>>>>>>>>>>> Privacy
Statement: https://www.ovirt.org/site/pri
>>>>>>>>>>>>>>>>>>> vacy-policy/
>>>>>>>>>>>>>>>>>>> oVirt Code of
Conduct: https://www.ovirt.org/communit
>>>>>>>>>>>>>>>>>>>
y/about/community-guidelines/
>>>>>>>>>>>>>>>>>>> List
Archives: https://lists.ovirt.org/archiv
>>>>>>>>>>>>>>>>>>>
es/list/users(a)ovirt.org/messag
>>>>>>>>>>>>>>>>>>>
e/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>> Users mailing list --
users(a)ovirt.org
>>>>>>>>>>>>>> To unsubscribe send an email to
users-leave(a)ovirt.org
>>>>>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/pri
>>>>>>>>>>>>>> vacy-policy/
>>>>>>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>>>>>>> y/about/community-guidelines/
>>>>>>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>>>>>>
es/list/users(a)ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDE
>>>>>>>>>>>>>> S5MF/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list -- users(a)ovirt.org
>>>>>>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>>>>>>> y/about/community-guidelines/
>>>>>>>> List Archives: https://lists.ovirt.org/archiv
>>>>>>>>
es/list/users(a)ovirt.org/message/JKBSLSZFRMFGILXGZB4YAMJIJKEKNGMA/
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] Re: Gluster problems, cluster performance issues