I tried to capture a consistent log slice accros the cluster from
agent.log. Several systems claim they are starting the engine, but virsh
never shows any sign of it. They also claim that a uuid that I'm pretty
sure is not affiliated with my engine is not found.
---------
ovirt1:
---------
MainThread::INFO::2018-05-30
00:18:11,628::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineForceStop-EngineDown)
sent? sent
[root@ovirt1 vdsm]# tail -n 20 ../ovirt-hosted-engine-ha/agent.log
MainThread::INFO::2018-05-30
00:14:02,154::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
MainThread::INFO::2018-05-30
00:14:02,173::states::776::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Another host already took over..
MainThread::INFO::2018-05-30
00:14:02,174::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
Timeout cleared while transitioning <class
'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class
'ovirt_hosted_engine_ha.agent.states.EngineForceStop'>
MainThread::INFO::2018-05-30
00:14:02,533::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition
(EngineStarting-EngineForceStop) sent? sent
MainThread::INFO::2018-05-30
00:14:02,843::ovf_store::151::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
OVF_STORE volume path:
/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
MainThread::INFO::2018-05-30
00:18:09,236::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineForceStop (score: 3400)
MainThread::INFO::2018-05-30
00:18:09,236::hosted_engine::499::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Best remote host
ovirt3.nwfiber.com (id: 3, score: 3400)
MainThread::INFO::2018-05-30
00:18:09,253::hosted_engine::1007::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
Shutting down vm using `/usr/sbin/hosted-engine --vm-poweroff`
MainThread::INFO::2018-05-30
00:18:11,159::hosted_engine::1012::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stdout:
MainThread::INFO::2018-05-30
00:18:11,160::hosted_engine::1013::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stderr:
MainThread::ERROR::2018-05-30
00:18:11,160::hosted_engine::1021::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
Engine VM stopped on localhost
MainThread::INFO::2018-05-30
00:18:11,628::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineForceStop-EngineDown)
sent? sent
MainThread::INFO::2018-05-30
00:22:17,692::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineDown (score: 3400)
MainThread::INFO::2018-05-30
00:22:17,713::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Global metadata: {'maintenance': False}
MainThread::INFO::2018-05-30
00:22:17,713::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host
ovirt2.nwfiber.com (id 2): {'conf_on_shared_storage': True, 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4348 (Wed
May 30 00:16:46 2018)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=4349
(Wed May 30 00:16:47
2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineDown\nstopped=False\n',
'hostname': 'ovirt2.nwfiber.com', 'alive': True,
'host-id': 2,
'engine-status': {'reason': 'bad vm status', 'health':
'bad', 'vm':
'down_unexpected', 'detail': 'Down'}, 'score': 3400,
'stopped': False,
'maintenance': False, 'crc32': '5ee0a557',
'local_conf_timestamp': 4349,
'host-ts': 4348}
MainThread::INFO::2018-05-30
00:22:17,713::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host
ovirt3.nwfiber.com (id 3): {'conf_on_shared_storage': True, 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=100036
(Wed May 30 00:16:58
2018)\nhost-id=3\nscore=3400\nvm_conf_refresh_time=100037 (Wed May 30
00:16:58
2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineStart\nstopped=False\n',
'hostname': 'ovirt3.nwfiber.com', 'alive': True,
'host-id': 3,
'engine-status': {'reason': 'bad vm status', 'health':
'bad', 'vm':
'down_unexpected', 'detail': 'Down'}, 'score': 3400,
'stopped': False,
'maintenance': False, 'crc32': 'e2fadf31',
'local_conf_timestamp': 100037,
'host-ts': 100036}
MainThread::INFO::2018-05-30
00:22:17,713::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Local (id 1): {'engine-health': {'reason': 'vm not running on this
host',
'health': 'bad', 'vm': 'down', 'detail':
'unknown'}, 'bridge': True,
'mem-free': 30186.0, 'maintenance': False, 'cpu-load': 0.0308,
'gateway':
1.0, 'storage-domain': True}
MainThread::INFO::2018-05-30
00:22:17,714::states::508::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2018-05-30
00:22:18,309::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineStart)
sent? sent
MainThread::INFO::2018-05-30
00:22:18,624::ovf_store::151::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
OVF_STORE volume path:
/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
----------
ovirt2
----------
MainThread::INFO::2018-05-30
00:08:34,534::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUnexpectedlyDown (score: 0)
MainThread::INFO::2018-05-30
00:08:34,534::hosted_engine::499::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Best remote host
ovirt3.nwfiber.com (id: 3, score: 3400)
MainThread::INFO::2018-05-30
00:08:34,545::states::678::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down, local host does not have best score
MainThread::INFO::2018-05-30
00:12:40,627::states::693::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Score is 0 due to unexpected vm shutdown at Wed May 30 00:04:28 2018
MainThread::INFO::2018-05-30
00:12:40,628::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUnexpectedlyDown (score: 0)
MainThread::INFO::2018-05-30
00:12:40,628::hosted_engine::499::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Best remote host
ovirt1.nwfiber.com (id: 1, score: 3400)
MainThread::INFO::2018-05-30
00:12:40,643::states::678::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down, local host does not have best score
MainThread::INFO::2018-05-30
00:12:40,827::ovf_store::151::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
OVF_STORE volume path:
/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
MainThread::INFO::2018-05-30
00:16:46,635::states::693::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Score is 0 due to unexpected vm shutdown at Wed May 30 00:04:28 2018
MainThread::INFO::2018-05-30
00:16:46,636::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUnexpectedlyDown (score: 0)
MainThread::INFO::2018-05-30
00:16:46,648::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Global metadata: {'maintenance': False}
MainThread::INFO::2018-05-30
00:16:46,648::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host
ovirt1.nwfiber.com (id 1): {'conf_on_shared_storage': True, 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=7614 (Wed
May 30 00:09:54 2018)\nhost-id=1\nscore=3400\nvm_conf_refresh_time=7614
(Wed May 30 00:09:54
2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineStarting\nstopped=False\n',
'hostname': 'ovirt1.nwfiber.com', 'alive': True,
'host-id': 1,
'engine-status': {'reason': 'vm not running on this host',
'health': 'bad',
'vm': 'down', 'detail': 'unknown'}, 'score': 3400,
'stopped': False,
'maintenance': False, 'crc32': '1fa13e84',
'local_conf_timestamp': 7614,
'host-ts': 7614}
MainThread::INFO::2018-05-30
00:16:46,648::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Host
ovirt3.nwfiber.com (id 3): {'conf_on_shared_storage': True, 'extra':
'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=99541 (Wed
May 30 00:08:43 2018)\nhost-id=3\nscore=3400\nvm_conf_refresh_time=99542
(Wed May 30 00:08:43
2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineForceStop\nstopped=False\n',
'hostname': 'ovirt3.nwfiber.com', 'alive': True,
'host-id': 3,
'engine-status': {'reason': 'bad vm status', 'health':
'bad', 'vm': 'down',
'detail': 'Down'}, 'score': 3400, 'stopped': False,
'maintenance': False,
'crc32': '56f87e2c', 'local_conf_timestamp': 99542,
'host-ts': 99541}
MainThread::INFO::2018-05-30
00:16:46,648::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Local (id 2): {'engine-health': {'reason': 'bad vm status',
'health':
'bad', 'vm': 'down_unexpected', 'detail': 'Down'},
'bridge': True,
'mem-free': 30891.0, 'maintenance': False, 'cpu-load': 0.0186,
'gateway':
1.0, 'storage-domain': True}
MainThread::INFO::2018-05-30
00:16:46,919::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition
(EngineUnexpectedlyDown-EngineDown) sent? sent
MainThread::INFO::2018-05-30
00:20:52,918::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineDown (score: 3400)
MainThread::INFO::2018-05-30
00:20:52,918::hosted_engine::499::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Best remote host
ovirt1.nwfiber.com (id: 1, score: 3400)
MainThread::INFO::2018-05-30
00:20:52,930::states::508::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2018-05-30
00:20:53,203::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineStart)
sent? sent
MainThread::INFO::2018-05-30
00:20:53,393::ovf_store::151::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
OVF_STORE volume path:
/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
--------
ovirt3
-------
MainThread::INFO::2018-05-30
00:21:07,343::hosted_engine::929::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Starting vm using `/usr/sbin/hosted-engine --vm-start`
MainThread::INFO::2018-05-30
00:21:08,107::hosted_engine::935::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
stdout:
MainThread::INFO::2018-05-30
00:21:08,108::hosted_engine::936::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
stderr: Command VM.getStats with args {'vmID':
'50392390-4f89-435c-bf3b-8254b58a4ef7'} failed:
(code=1, message=Virtual machine does not exist: {'vmId':
u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
MainThread::INFO::2018-05-30
00:21:08,108::hosted_engine::948::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
MainThread::INFO::2018-05-30
00:21:08,413::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineStart-EngineStarting)
sent? sent
[root@ovirt3 images]# tail -n 20 /var/log/ovirt-hosted-engine-ha/agent.log
MainThread::INFO::2018-05-30
00:12:51,792::hosted_engine::1012::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stdout:
MainThread::INFO::2018-05-30
00:12:51,792::hosted_engine::1013::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stderr:
MainThread::ERROR::2018-05-30
00:12:51,793::hosted_engine::1021::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
Engine VM stopped on localhost
MainThread::INFO::2018-05-30
00:12:52,157::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineForceStop-EngineDown)
sent? sent
MainThread::INFO::2018-05-30
00:16:58,237::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineDown (score: 3400)
MainThread::INFO::2018-05-30
00:16:58,256::states::508::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2018-05-30
00:16:58,595::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineStart)
sent? sent
MainThread::INFO::2018-05-30
00:16:58,816::ovf_store::151::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
OVF_STORE volume path:
/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
MainThread::INFO::2018-05-30
00:21:05,082::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStart (score: 3400)
MainThread::INFO::2018-05-30
00:21:05,083::hosted_engine::499::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Best remote host
ovirt1.nwfiber.com (id: 1, score: 3400)
MainThread::INFO::2018-05-30
00:21:05,125::hosted_engine::968::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
Ensuring VDSM state is clear for engine VM
MainThread::INFO::2018-05-30
00:21:05,141::hosted_engine::988::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
Cleaning state for non-running VM
MainThread::INFO::2018-05-30
00:21:07,343::hosted_engine::980::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
Vdsm state for VM clean
MainThread::INFO::2018-05-30
00:21:07,343::hosted_engine::929::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Starting vm using `/usr/sbin/hosted-engine --vm-start`
MainThread::INFO::2018-05-30
00:21:08,107::hosted_engine::935::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
stdout:
MainThread::INFO::2018-05-30
00:21:08,108::hosted_engine::936::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
stderr: Command VM.getStats with args {'vmID':
'50392390-4f89-435c-bf3b-8254b58a4ef7'} failed:
(code=1, message=Virtual machine does not exist: {'vmId':
u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
MainThread::INFO::2018-05-30
00:21:08,108::hosted_engine::948::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
MainThread::INFO::2018-05-30
00:21:08,413::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineStart-EngineStarting)
sent? sent
----------
note that virsh does not show engine started on this machine....
On Wed, May 30, 2018 at 12:16 AM, Jim Kusznir <jim(a)palousetech.com> wrote:
So, on ovirt2, I ran the hosted-engine --vm-status and it took over
2
minutes to get it to return, but two of the 3 machines had current
(non-stale) data:
[root@ovirt2 ~]# hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname :
ovirt1.nwfiber.com
Host ID : 1
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : f562054b
local_conf_timestamp : 7368
Host timestamp : 7367
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=7367 (Wed May 30 00:05:47 2018)
host-id=1
score=3400
vm_conf_refresh_time=7368 (Wed May 30 00:05:48 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineStart
stopped=False
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname :
ovirt2.nwfiber.com
Host ID : 2
Engine status : {"reason": "bad vm status",
"health":
"bad", "vm": "down_unexpected", "detail":
"Down"}
Score : 0
stopped : False
Local maintenance : False
crc32 : 6822f86a
local_conf_timestamp : 3610
Host timestamp : 3610
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=3610 (Wed May 30 00:04:28 2018)
host-id=2
score=0
vm_conf_refresh_time=3610 (Wed May 30 00:04:28 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
timeout=Wed Dec 31 17:09:30 1969
--== Host 3 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname :
ovirt3.nwfiber.com
Host ID : 3
Engine status : {"reason": "vm not running on this
host", "health": "bad", "vm": "down",
"detail": "unknown"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 006512a3
local_conf_timestamp : 99294
Host timestamp : 99294
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=99294 (Wed May 30 00:04:36 2018)
host-id=3
score=3400
vm_conf_refresh_time=99294 (Wed May 30 00:04:36 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
---------------
I saw that the ha score was 0 on ovirt2 (where I was focusing my efforts),
and ovirt3 had a good ha score, I tried to launch it there:
[root@ovirt3 ff3d0185-d120-40ee-9792-ebbd5caaca8b]# hosted-engine
--vm-start
Command VM.getStats with args {'vmID':
'50392390-4f89-435c-bf3b-8254b58a4ef7'}
failed:
(code=1, message=Virtual machine does not exist: {'vmId':
u'50392390-4f89-435c-bf3b-8254b58a4ef7'})
ovirt1 was returning the ha-agent or storage not online message, but now
its hanging (probably 2 minutes), so it may actually return status this
time. I figure if it can't return status, then there's no point in trying
to start the VM there.
On Wed, May 30, 2018 at 12:08 AM, Jim Kusznir <jim(a)palousetech.com> wrote:
> The agent..log was sent minutes ago in another message. Here's the
> gluster client log for the engine volume:
>
> [2018-05-30 06:31:12.343709] I [MSGID: 108026]
> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> 0-engine-replicate-0: performing metadata selfheal on
> 3a430e50-48cb-463c-8657-aab5458ac155
> [2018-05-30 06:31:12.348210] W [MSGID: 108015]
> [afr-self-heal-entry.c:56:afr_selfheal_entry_delete]
> 0-engine-replicate-0: expunging file
117c5234-e6d6-4f03-809b-a4ef6a6fac4d/hosted-engine.metadata
> (38a277cd-113d-4ca8-8f48-fadc944d3272) on engine-client-1
> [2018-05-30 06:31:12.360602] I [MSGID: 108026]
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
> Completed metadata selfheal on 3a430e50-48cb-463c-8657-aab5458ac155.
> sources=[0] 2 sinks=1
> [2018-05-30 06:31:12.379169] I [MSGID: 108026]
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
> Completed entry selfheal on 117c5234-e6d6-4f03-809b-a4ef6a6fac4d.
> sources=[0] 2 sinks=1
> [2018-05-30 06:32:54.756729] I [MSGID: 108026]
> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> 0-engine-replicate-0: performing metadata selfheal on
> 2ecf0f33-1385-416e-8619-129bdb62b61f
> [2018-05-30 06:32:54.763175] I [MSGID: 108026]
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
> Completed metadata selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
> sources=[0] 2 sinks=1
> [2018-05-30 06:37:06.581538] I [MSGID: 108026]
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
> Completed data selfheal on 06c557a9-b550-4e25-b190-804e50cea65b.
> sources=[0] 2 sinks=1
> [2018-05-30 06:39:55.739302] I [MSGID: 108026]
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
> sources=[0] 2 sinks=1
> [2018-05-30 06:44:04.911825] I [MSGID: 108026]
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
> sources=[0] 2 sinks=1
> [2018-05-30 06:58:05.123854] I [MSGID: 108026]
> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> 0-engine-replicate-0: performing metadata selfheal on
> 03ca7a59-e8ee-4a70-bc68-47e972aeb801
> [2018-05-30 06:58:05.131929] I [MSGID: 108026]
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
> Completed metadata selfheal on 03ca7a59-e8ee-4a70-bc68-47e972aeb801.
> sources=[0] 2 sinks=1
> [2018-05-30 07:01:42.583991] I [MSGID: 108026]
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
> sources=[0] 2 sinks=1
> [2018-05-30 07:05:48.777982] I [MSGID: 108026]
> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-engine-replicate-0:
> Completed data selfheal on 2ecf0f33-1385-416e-8619-129bdb62b61f.
> sources=[0] 2 sinks=1
>
> --------
> broker log:
> --------
> Thread-1::INFO::2018-05-30 00:07:13,241::ping::60::ping.Ping::(action)
> Successfully pinged 192.168.8.1
> Listener::INFO::2018-05-30 00:07:14,344::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:07:19,356::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Thread-5::INFO::2018-05-30 00:07:22,081::engine_health::2
> 02::engine_health.EngineHealth::(_result_from_stats) VM not running on
> this host, status Down
> Listener::INFO::2018-05-30 00:07:24,369::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:07:29,382::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:07:34,393::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:07:39,406::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:07:44,419::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:07:49,431::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:07:54,444::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:07:59,452::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Thread-2::INFO::2018-05-30 00:08:03,736::mgmt_bridge::62:
> :mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports
> Listener::INFO::2018-05-30 00:08:04,460::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:08:09,468::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Thread-4::INFO::2018-05-30 00:08:11,461::cpu_load_no_engi
> ne::116::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) VM stats
> do not contain cpu usage. VM might be down.
> Thread-4::INFO::2018-05-30 00:08:11,461::cpu_load_no_engi
> ne::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System
> load total=0.0169, engine=0.0000, non-engine=0.0169
> Listener::INFO::2018-05-30 00:08:14,481::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Listener::INFO::2018-05-30 00:08:19,490::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
> Thread-5::INFO::2018-05-30 00:08:22,175::engine_health::2
> 02::engine_health.EngineHealth::(_result_from_stats) VM not running on
> this host, status Down
> Thread-1::INFO::2018-05-30 00:08:22,436::ping::60::ping.Ping::(action)
> Successfully pinged 192.168.8.1
> Listener::INFO::2018-05-30 00:08:24,503::storage_broker::304::
> ovirt_hosted_engine_ha.broker.storage_broker.StorageBro
> ker::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
>
>
> On Wed, May 30, 2018 at 12:03 AM, Sahina Bose <sabose(a)redhat.com> wrote:
>
>> [Adding gluster-users back]
>>
>> Nothing amiss with volume info and status.
>> Can you check the agent.log and broker.log - will be under
>> /var/log/ovirt-hosted-engine-ha/
>>
>> Also the gluster client logs - under /var/log/glusterfs/rhev-data-c
>> enter-mnt-glusterSD<volume>.log
>>
>>
>>
>> On Wed, May 30, 2018 at 12:08 PM, Jim Kusznir <jim(a)palousetech.com>
>> wrote:
>>
>>> I believe the gluster data store for the engine is up and working
>>> correctly. The rest are not mounted, as the engine hasn't managed to
start
>>> correctly yet. I did perform the copy-a-junk-file into the data store,
>>> then check to ensure its there on another host, then delete that and see
>>> that it disappeared on the first host; it passed that test. Here's the
>>> info and status. (I have NOT performed the steps that Krutika and
>>> Ravishankar suggested yet, as I don't have my data volumes working again
>>> yet.
>>>
>>> [root@ovirt2 images]# gluster volume info
>>>
>>> Volume Name: data
>>> Type: Replicate
>>> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x (2 + 1) = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
>>> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
>>> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
>>> Options Reconfigured:
>>> diagnostics.count-fop-hits: on
>>> diagnostics.latency-measurement: on
>>> changelog.changelog: on
>>> geo-replication.ignore-pid-check: on
>>> geo-replication.indexing: on
>>> server.allow-insecure: on
>>> performance.readdir-ahead: on
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: off
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> features.shard: on
>>> features.shard-block-size: 512MB
>>> performance.low-prio-threads: 32
>>> cluster.data-self-heal-algorithm: full
>>> cluster.locking-scheme: granular
>>> cluster.shd-wait-qlength: 10000
>>> cluster.shd-max-threads: 8
>>> network.ping-timeout: 30
>>> user.cifs: off
>>> nfs.disable: on
>>> performance.strict-o-direct: on
>>>
>>> Volume Name: data-hdd
>>> Type: Replicate
>>> Volume ID: d342a3ab-16f3-49f0-bbcf-f788be8ac5f1
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x 3 = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: 172.172.1.11:/gluster/brick3/data-hdd
>>> Brick2: 172.172.1.12:/gluster/brick3/data-hdd
>>> Brick3: 172.172.1.13:/gluster/brick3/data-hdd
>>> Options Reconfigured:
>>> diagnostics.count-fop-hits: on
>>> diagnostics.latency-measurement: on
>>> network.ping-timeout: 30
>>> server.allow-insecure: on
>>> storage.owner-gid: 36
>>> storage.owner-uid: 36
>>> user.cifs: off
>>> features.shard: on
>>> cluster.shd-wait-qlength: 10000
>>> cluster.shd-max-threads: 8
>>> cluster.locking-scheme: granular
>>> cluster.data-self-heal-algorithm: full
>>> cluster.server-quorum-type: server
>>> cluster.quorum-type: auto
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> performance.low-prio-threads: 32
>>> performance.stat-prefetch: off
>>> performance.io-cache: off
>>> performance.read-ahead: off
>>> performance.quick-read: off
>>> changelog.changelog: on
>>> geo-replication.ignore-pid-check: on
>>> geo-replication.indexing: on
>>> transport.address-family: inet
>>> performance.readdir-ahead: on
>>>
>>> Volume Name: engine
>>> Type: Replicate
>>> Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x (2 + 1) = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine
>>> Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine
>>> Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter)
>>> Options Reconfigured:
>>> changelog.changelog: on
>>> geo-replication.ignore-pid-check: on
>>> geo-replication.indexing: on
>>> performance.readdir-ahead: on
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: off
>>> cluster.eager-lock: enable
>>> network.remote-dio: off
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> features.shard: on
>>> features.shard-block-size: 512MB
>>> performance.low-prio-threads: 32
>>> cluster.data-self-heal-algorithm: full
>>> cluster.locking-scheme: granular
>>> cluster.shd-wait-qlength: 10000
>>> cluster.shd-max-threads: 6
>>> network.ping-timeout: 30
>>> user.cifs: off
>>> nfs.disable: on
>>> performance.strict-o-direct: on
>>>
>>> Volume Name: iso
>>> Type: Replicate
>>> Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x (2 + 1) = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso
>>> Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso
>>> Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter)
>>> Options Reconfigured:
>>> performance.strict-o-direct: on
>>> nfs.disable: on
>>> user.cifs: off
>>> network.ping-timeout: 30
>>> cluster.shd-max-threads: 6
>>> cluster.shd-wait-qlength: 10000
>>> cluster.locking-scheme: granular
>>> cluster.data-self-heal-algorithm: full
>>> performance.low-prio-threads: 32
>>> features.shard-block-size: 512MB
>>> features.shard: on
>>> storage.owner-gid: 36
>>> storage.owner-uid: 36
>>> cluster.server-quorum-type: server
>>> cluster.quorum-type: auto
>>> network.remote-dio: off
>>> cluster.eager-lock: enable
>>> performance.stat-prefetch: off
>>> performance.io-cache: off
>>> performance.read-ahead: off
>>> performance.quick-read: off
>>> performance.readdir-ahead: on
>>>
>>>
>>> [root@ovirt2 images]# gluster volume status
>>> Status of volume: data
>>> Gluster process TCP Port RDMA Port
>>> Online Pid
>>> ------------------------------------------------------------
>>> ------------------
>>> Brick ovirt1.nwfiber.com:/gluster/brick2/da
>>> ta 49152 0 Y
>>> 3226
>>> Brick ovirt2.nwfiber.com:/gluster/brick2/da
>>> ta 49152 0 Y
>>> 2967
>>> Brick ovirt3.nwfiber.com:/gluster/brick2/da
>>> ta 49152 0 Y
>>> 2554
>>> Self-heal Daemon on localhost N/A N/A Y
>>> 4818
>>> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
>>> 17853
>>> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
>>> 4771
>>>
>>> Task Status of Volume data
>>> ------------------------------------------------------------
>>> ------------------
>>> There are no active volume tasks
>>>
>>> Status of volume: data-hdd
>>> Gluster process TCP Port RDMA Port
>>> Online Pid
>>> ------------------------------------------------------------
>>> ------------------
>>> Brick 172.172.1.11:/gluster/brick3/data-hdd 49153 0 Y
>>> 3232
>>> Brick 172.172.1.12:/gluster/brick3/data-hdd 49153 0 Y
>>> 2976
>>> Brick 172.172.1.13:/gluster/brick3/data-hdd N/A N/A N
>>> N/A
>>> NFS Server on localhost N/A N/A N
>>> N/A
>>> Self-heal Daemon on localhost N/A N/A Y
>>> 4818
>>> NFS Server on
ovirt3.nwfiber.com N/A N/A N
>>> N/A
>>> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
>>> 17853
>>> NFS Server on
ovirt1.nwfiber.com N/A N/A N
>>> N/A
>>> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
>>> 4771
>>>
>>> Task Status of Volume data-hdd
>>> ------------------------------------------------------------
>>> ------------------
>>> There are no active volume tasks
>>>
>>> Status of volume: engine
>>> Gluster process TCP Port RDMA Port
>>> Online Pid
>>> ------------------------------------------------------------
>>> ------------------
>>> Brick ovirt1.nwfiber.com:/gluster/brick1/en
>>> gine 49154 0 Y
>>> 3239
>>> Brick ovirt2.nwfiber.com:/gluster/brick1/en
>>> gine 49154 0 Y
>>> 2982
>>> Brick ovirt3.nwfiber.com:/gluster/brick1/en
>>> gine 49154 0 Y
>>> 2578
>>> Self-heal Daemon on localhost N/A N/A Y
>>> 4818
>>> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
>>> 17853
>>> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
>>> 4771
>>>
>>> Task Status of Volume engine
>>> ------------------------------------------------------------
>>> ------------------
>>> There are no active volume tasks
>>>
>>> Status of volume: iso
>>> Gluster process TCP Port RDMA Port
>>> Online Pid
>>> ------------------------------------------------------------
>>> ------------------
>>> Brick ovirt1.nwfiber.com:/gluster/brick4/is
>>> o 49155 0 Y
>>> 3247
>>> Brick ovirt2.nwfiber.com:/gluster/brick4/is
>>> o 49155 0 Y
>>> 2990
>>> Brick ovirt3.nwfiber.com:/gluster/brick4/is
>>> o 49155 0 Y
>>> 2580
>>> Self-heal Daemon on localhost N/A N/A Y
>>> 4818
>>> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
>>> 17853
>>> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
>>> 4771
>>>
>>> Task Status of Volume iso
>>> ------------------------------------------------------------
>>> ------------------
>>> There are no active volume tasks
>>>
>>>
>>> On Tue, May 29, 2018 at 11:30 PM, Sahina Bose <sabose(a)redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, May 30, 2018 at 10:42 AM, Jim Kusznir
<jim(a)palousetech.com>
>>>> wrote:
>>>>
>>>>> hosted-engine --deploy failed (would not come up on my existing
>>>>> gluster storage). However, I realized no changes were written to my
>>>>> existing storage. So, I went back to trying to get my old engine
running.
>>>>>
>>>>> hosted-engine --vm-status is now taking a very long time (5+minutes)
>>>>> to return, and it returns stail information everywhere. I thought
perhaps
>>>>> the lockspace is corrupt, so tried to clean that and metadata, but
both are
>>>>> failing (--cleam-metadata has hung and I can't even ctrl-c out of
it).
>>>>>
>>>>> How can I reinitialize all the lockspace/metadata safely? There is
>>>>> no engine or VMs running currently....
>>>>>
>>>>
>>>> I think the first thing to make sure is that your storage is up and
>>>> running. So can you mount the gluster volumes and able to access the
>>>> contents there?
>>>> Please provide the gluster volume info and gluster volume status of
>>>> the volumes that you're using.
>>>>
>>>>
>>>>
>>>>> --Jim
>>>>>
>>>>> On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>> wrote:
>>>>>
>>>>>> Well, things went from bad to very, very bad....
>>>>>>
>>>>>> It appears that during one of the 2 minute lockups, the fencing
>>>>>> agents decided that another node in the cluster was down. As a
result, 2
>>>>>> of the 3 nodes were simultaneously reset with fencing agent
reboot. After
>>>>>> the nodes came back up, the engine would not start. All running
VMs
>>>>>> (including VMs on the 3rd node that was not rebooted) crashed.
>>>>>>
>>>>>> I've now been working for about 3 hours trying to get the
engine to
>>>>>> come up. I don't know why it won't start. hosted-engine
--vm-start says
>>>>>> its starting, but it doesn't start (virsh doesn't show
any VMs running).
>>>>>> I'm currently running --deploy, as I had run out of options
for anything
>>>>>> else I can come up with. I hope this will allow me to re-import
all my
>>>>>> existing VMs and allow me to start them back up after everything
comes back
>>>>>> up.
>>>>>>
>>>>>> I do have an unverified geo-rep backup; I don't know if it is
a good
>>>>>> backup (there were several prior messages to this list, but I
didn't get
>>>>>> replies to my questions. It was running in what I believe to be
"strange",
>>>>>> and the data directories are larger than their source).
>>>>>>
>>>>>> I'll see if my --deploy works, and if not, I'll be back
with another
>>>>>> message/help request.
>>>>>>
>>>>>> When the dust settles and I'm at least minimally functional
again, I
>>>>>> really want to understand why all these technologies designed to
offer
>>>>>> redundancy conspired to reduce uptime and create failures where
there
>>>>>> weren't any otherwise. I thought with hosted engine, 3 ovirt
servers and
>>>>>> glusterfs with minimum replica 2+arb or replica 3 should have
offered
>>>>>> strong resilience against server failure or disk failure, and
should have
>>>>>> prevented / recovered from data corruption. Instead, all of the
above
>>>>>> happened (once I get my cluster back up, I still have to try and
recover my
>>>>>> webserver VM, which won't boot due to XFS corrupt journal
issues created
>>>>>> during the gluster crashes). I think a lot of these issues were
rooted
>>>>>> from the upgrade from 4.1 to 4.2.
>>>>>>
>>>>>> --Jim
>>>>>>
>>>>>> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I also finally found the following in my system log on one
server:
>>>>>>>
>>>>>>> [10679.524491] INFO: task glusterclogro:14933 blocked for
more than
>>>>>>> 120 seconds.
>>>>>>> [10679.525826] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [10679.527144] glusterclogro D ffff97209832bf40 0
14933
>>>>>>> 1 0x00000080
>>>>>>> [10679.527150] Call Trace:
>>>>>>> [10679.527161] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [10679.527218] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>>> [xfs]
>>>>>>> [10679.527225] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>> [10679.527254] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>> [xfs]
>>>>>>> [10679.527260] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>> [10679.527268] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [10679.527271] [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>>>> [10679.527275] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>>> [10679.527279] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [10679.527283] INFO: task glusterposixfsy:14941 blocked for
more
>>>>>>> than 120 seconds.
>>>>>>> [10679.528608] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [10679.529956] glusterposixfsy D ffff972495f84f10 0
14941
>>>>>>> 1 0x00000080
>>>>>>> [10679.529961] Call Trace:
>>>>>>> [10679.529966] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [10679.530003] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>>> [xfs]
>>>>>>> [10679.530008] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>> [10679.530038] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>> [xfs]
>>>>>>> [10679.530042] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>> [10679.530046] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [10679.530050] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>>> [10679.530054] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>>> [10679.530058] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [10679.530062] INFO: task glusteriotwr13:15486 blocked for
more
>>>>>>> than 120 seconds.
>>>>>>> [10679.531805] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [10679.533732] glusteriotwr13 D ffff9720a83f0000 0
15486
>>>>>>> 1 0x00000080
>>>>>>> [10679.533738] Call Trace:
>>>>>>> [10679.533747] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [10679.533799] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>>> [xfs]
>>>>>>> [10679.533806] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>> [10679.533846] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>> [xfs]
>>>>>>> [10679.533852] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>> [10679.533858] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [10679.533863] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>>> [10679.533868] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>>> [10679.533873] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [10919.512757] INFO: task glusterclogro:14933 blocked for
more than
>>>>>>> 120 seconds.
>>>>>>> [10919.514714] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [10919.516663] glusterclogro D ffff97209832bf40 0
14933
>>>>>>> 1 0x00000080
>>>>>>> [10919.516677] Call Trace:
>>>>>>> [10919.516690] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [10919.516696] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>> [10919.516703] [<ffffffffb951cc04>] ?
blk_finish_plug+0x14/0x40
>>>>>>> [10919.516768] [<ffffffffc05e9224>] ?
_xfs_buf_ioapply+0x334/0x460
>>>>>>> [xfs]
>>>>>>> [10919.516774] [<ffffffffb991432d>]
wait_for_completion+0xfd/0x140
>>>>>>> [10919.516782] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>> [10919.516821] [<ffffffffc05eb0a3>] ?
_xfs_buf_read+0x23/0x40 [xfs]
>>>>>>> [10919.516859] [<ffffffffc05eafa9>]
xfs_buf_submit_wait+0xf9/0x1d0
>>>>>>> [xfs]
>>>>>>> [10919.516902] [<ffffffffc061b279>] ?
>>>>>>> xfs_trans_read_buf_map+0x199/0x400 [xfs]
>>>>>>> [10919.516940] [<ffffffffc05eb0a3>]
_xfs_buf_read+0x23/0x40 [xfs]
>>>>>>> [10919.516977] [<ffffffffc05eb1b9>]
xfs_buf_read_map+0xf9/0x160
>>>>>>> [xfs]
>>>>>>> [10919.517022] [<ffffffffc061b279>]
xfs_trans_read_buf_map+0x199/0x400
>>>>>>> [xfs]
>>>>>>> [10919.517057] [<ffffffffc05c8d04>]
xfs_da_read_buf+0xd4/0x100
>>>>>>> [xfs]
>>>>>>> [10919.517091] [<ffffffffc05c8d53>]
xfs_da3_node_read+0x23/0xd0
>>>>>>> [xfs]
>>>>>>> [10919.517126] [<ffffffffc05c9fee>]
xfs_da3_node_lookup_int+0x6e/0x2f0
>>>>>>> [xfs]
>>>>>>> [10919.517160] [<ffffffffc05d5a1d>]
xfs_dir2_node_lookup+0x4d/0x170
>>>>>>> [xfs]
>>>>>>> [10919.517194] [<ffffffffc05ccf5d>]
xfs_dir_lookup+0x1bd/0x1e0
>>>>>>> [xfs]
>>>>>>> [10919.517233] [<ffffffffc05fd8d9>]
xfs_lookup+0x69/0x140 [xfs]
>>>>>>> [10919.517271] [<ffffffffc05fa018>]
xfs_vn_lookup+0x78/0xc0 [xfs]
>>>>>>> [10919.517278] [<ffffffffb9425cf3>]
lookup_real+0x23/0x60
>>>>>>> [10919.517283] [<ffffffffb9426702>]
__lookup_hash+0x42/0x60
>>>>>>> [10919.517288] [<ffffffffb942d519>]
SYSC_renameat2+0x3a9/0x5a0
>>>>>>> [10919.517296] [<ffffffffb94d3753>] ?
>>>>>>> selinux_file_free_security+0x23/0x30
>>>>>>> [10919.517304] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [10919.517309] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [10919.517313] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [10919.517318] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [10919.517323] [<ffffffffb942e58e>]
SyS_renameat2+0xe/0x10
>>>>>>> [10919.517328] [<ffffffffb942e5ce>]
SyS_rename+0x1e/0x20
>>>>>>> [10919.517333] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>>> [10919.517339] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [11159.496095] INFO: task glusteriotwr9:15482 blocked for
more than
>>>>>>> 120 seconds.
>>>>>>> [11159.497546] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [11159.498978] glusteriotwr9 D ffff971fa0fa1fa0 0
15482
>>>>>>> 1 0x00000080
>>>>>>> [11159.498984] Call Trace:
>>>>>>> [11159.498995] [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50
>>>>>>> [11159.498999] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [11159.499003] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>> [11159.499056] [<ffffffffc05dd9b7>] ?
>>>>>>> xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]
>>>>>>> [11159.499082] [<ffffffffc05dd43e>] ?
>>>>>>> xfs_iext_bno_to_irec+0x8e/0xd0 [xfs]
>>>>>>> [11159.499090] [<ffffffffb92f7a12>] ?
ktime_get_ts64+0x52/0xf0
>>>>>>> [11159.499093] [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50
>>>>>>> [11159.499097] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>>>> [11159.499101] [<ffffffffb9913528>]
io_schedule+0x18/0x20
>>>>>>> [11159.499104] [<ffffffffb9911f11>]
bit_wait_io+0x11/0x50
>>>>>>> [11159.499107] [<ffffffffb9911ac1>]
__wait_on_bit_lock+0x61/0xc0
>>>>>>> [11159.499113] [<ffffffffb9393634>]
__lock_page+0x74/0x90
>>>>>>> [11159.499118] [<ffffffffb92bc210>] ?
wake_bit_function+0x40/0x40
>>>>>>> [11159.499121] [<ffffffffb9394154>]
__find_lock_page+0x54/0x70
>>>>>>> [11159.499125] [<ffffffffb9394e85>]
grab_cache_page_write_begin+0x
>>>>>>> 55/0xc0
>>>>>>> [11159.499130] [<ffffffffb9484b76>]
iomap_write_begin+0x66/0x100
>>>>>>> [11159.499135] [<ffffffffb9484edf>]
iomap_write_actor+0xcf/0x1d0
>>>>>>> [11159.499140] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>>>> [11159.499144] [<ffffffffb94854e7>]
iomap_apply+0xb7/0x150
>>>>>>> [11159.499149] [<ffffffffb9485621>]
iomap_file_buffered_write+0xa1
>>>>>>> /0xe0
>>>>>>> [11159.499153] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>>>> [11159.499182] [<ffffffffc05f025d>]
xfs_file_buffered_aio_write+0x12d/0x2c0
>>>>>>> [xfs]
>>>>>>> [11159.499213] [<ffffffffc05f057d>]
xfs_file_aio_write+0x18d/0x1b0
>>>>>>> [xfs]
>>>>>>> [11159.499217] [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>>>> [11159.499222] [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0
>>>>>>> [11159.499225] [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>>>> [11159.499230] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [11159.499234] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>>> [11159.499238] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [11279.488720] INFO: task xfsaild/dm-10:1134 blocked for more
than
>>>>>>> 120 seconds.
>>>>>>> [11279.490197] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [11279.491665] xfsaild/dm-10 D ffff9720a8660fd0 0
1134
>>>>>>> 2 0x00000000
>>>>>>> [11279.491671] Call Trace:
>>>>>>> [11279.491682] [<ffffffffb92a3a2e>] ?
>>>>>>> try_to_del_timer_sync+0x5e/0x90
>>>>>>> [11279.491688] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [11279.491744] [<ffffffffc060de36>]
_xfs_log_force+0x1c6/0x2c0
>>>>>>> [xfs]
>>>>>>> [11279.491750] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>> [11279.491783] [<ffffffffc0619fec>] ?
xfsaild+0x16c/0x6f0 [xfs]
>>>>>>> [11279.491817] [<ffffffffc060df5c>]
xfs_log_force+0x2c/0x70 [xfs]
>>>>>>> [11279.491849] [<ffffffffc0619e80>] ?
>>>>>>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>>>>>>> [11279.491880] [<ffffffffc0619fec>]
xfsaild+0x16c/0x6f0 [xfs]
>>>>>>> [11279.491913] [<ffffffffc0619e80>] ?
>>>>>>> xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>>>>>>> [11279.491919] [<ffffffffb92bb161>] kthread+0xd1/0xe0
>>>>>>> [11279.491926] [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40
>>>>>>> [11279.491932] [<ffffffffb9920677>]
ret_from_fork_nospec_begin+0x2
>>>>>>> 1/0x21
>>>>>>> [11279.491936] [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40
>>>>>>> [11279.491976] INFO: task glusterclogfsyn:14934 blocked for
more
>>>>>>> than 120 seconds.
>>>>>>> [11279.493466] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [11279.494952] glusterclogfsyn D ffff97209832af70 0
14934
>>>>>>> 1 0x00000080
>>>>>>> [11279.494957] Call Trace:
>>>>>>> [11279.494979] [<ffffffffc0309839>] ?
>>>>>>> __split_and_process_bio+0x2e9/0x520 [dm_mod]
>>>>>>> [11279.494983] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [11279.494987] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>> [11279.494997] [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0
>>>>>>> [dm_mod]
>>>>>>> [11279.495001] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>>>> [11279.495005] [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/0x
>>>>>>> 140
>>>>>>> [11279.495010] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>> [11279.495016] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>>>> [11279.495049] [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/0x20
>>>>>>> [xfs]
>>>>>>> [11279.495079] [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0
>>>>>>> [xfs]
>>>>>>> [11279.495086] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>> [11279.495090] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [11279.495094] [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>>>> [11279.495098] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>>> [11279.495102] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [11279.495105] INFO: task glusterposixfsy:14941 blocked for
more
>>>>>>> than 120 seconds.
>>>>>>> [11279.496606] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [11279.498114] glusterposixfsy D ffff972495f84f10 0
14941
>>>>>>> 1 0x00000080
>>>>>>> [11279.498118] Call Trace:
>>>>>>> [11279.498134] [<ffffffffc0309839>] ?
>>>>>>> __split_and_process_bio+0x2e9/0x520 [dm_mod]
>>>>>>> [11279.498138] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [11279.498142] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>>>> [11279.498152] [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0
>>>>>>> [dm_mod]
>>>>>>> [11279.498156] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>>>> [11279.498160] [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/0x
>>>>>>> 140
>>>>>>> [11279.498165] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>> [11279.498169] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>>>> [11279.498202] [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/0x20
>>>>>>> [xfs]
>>>>>>> [11279.498231] [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0
>>>>>>> [xfs]
>>>>>>> [11279.498238] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>> [11279.498242] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [11279.498246] [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>>>> [11279.498250] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>>> [11279.498254] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [11279.498257] INFO: task glusteriotwr1:14950 blocked for
more than
>>>>>>> 120 seconds.
>>>>>>> [11279.499789] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [11279.501343] glusteriotwr1 D ffff97208b6daf70 0
14950
>>>>>>> 1 0x00000080
>>>>>>> [11279.501348] Call Trace:
>>>>>>> [11279.501353] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [11279.501390] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>>> [xfs]
>>>>>>> [11279.501396] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>> [11279.501428] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>> [xfs]
>>>>>>> [11279.501432] [<ffffffffb944ef3f>]
generic_write_sync+0x4f/0x70
>>>>>>> [11279.501461] [<ffffffffc05f0545>]
xfs_file_aio_write+0x155/0x1b0
>>>>>>> [xfs]
>>>>>>> [11279.501466] [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>>>> [11279.501471] [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0
>>>>>>> [11279.501475] [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>>>> [11279.501479] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [11279.501483] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>>> [11279.501489] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [11279.501493] INFO: task glusteriotwr4:14953 blocked for
more than
>>>>>>> 120 seconds.
>>>>>>> [11279.503047] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [11279.504630] glusteriotwr4 D ffff972499f2bf40 0
14953
>>>>>>> 1 0x00000080
>>>>>>> [11279.504635] Call Trace:
>>>>>>> [11279.504640] [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>>>> [11279.504676] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>>>> [xfs]
>>>>>>> [11279.504681] [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>>>> [11279.504710] [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0
>>>>>>> [xfs]
>>>>>>> [11279.504714] [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>>>> [11279.504718] [<ffffffffb992076f>] ?
>>>>>>> system_call_after_swapgs+0xbc/0x160
>>>>>>> [11279.504722] [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>>>> [11279.504725] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>>>> [11279.504730] [<ffffffffb992077b>] ?
>>>>>>> system_call_after_swapgs+0xc8/0x160
>>>>>>> [12127.466494] perf: interrupt took too long (8263 >
8150),
>>>>>>> lowering kernel.perf_event_max_sample_rate to 24000
>>>>>>>
>>>>>>> --------------------
>>>>>>> I think this is the cause of the massive ovirt performance
issues
>>>>>>> irrespective of gluster volume. At the time this happened, I
was also
>>>>>>> ssh'ed into the host, and was doing some rpm querry
commands. I had just
>>>>>>> run rpm -qa |grep glusterfs (to verify what version was
actually
>>>>>>> installed), and that command took almost 2 minutes to return!
Normally it
>>>>>>> takes less than 2 seconds. That is all pure local SSD IO,
too....
>>>>>>>
>>>>>>> I'm no expert, but its my understanding that anytime a
software
>>>>>>> causes these kinds of issues, its a serious bug in the
software, even if
>>>>>>> its mis-handled exceptions. Is this correct?
>>>>>>>
>>>>>>> --Jim
>>>>>>>
>>>>>>> On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think this is the profile information for one of the
volumes
>>>>>>>> that lives on the SSDs and is fully operational with no
down/problem disks:
>>>>>>>>
>>>>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile data
info
>>>>>>>> Brick: ovirt2.nwfiber.com:/gluster/brick2/data
>>>>>>>> ----------------------------------------------
>>>>>>>> Cumulative Stats:
>>>>>>>> Block Size: 256b+
512b+
>>>>>>>> 1024b+
>>>>>>>> No. of Reads: 983
2696
>>>>>>>> 1059
>>>>>>>> No. of Writes: 0
1113
>>>>>>>> 302
>>>>>>>>
>>>>>>>> Block Size: 2048b+
4096b+
>>>>>>>> 8192b+
>>>>>>>> No. of Reads: 852
88608
>>>>>>>> 53526
>>>>>>>> No. of Writes: 522
812340
>>>>>>>> 76257
>>>>>>>>
>>>>>>>> Block Size: 16384b+
32768b+
>>>>>>>> 65536b+
>>>>>>>> No. of Reads: 54351
241901
>>>>>>>> 15024
>>>>>>>> No. of Writes: 21636
8656
>>>>>>>> 8976
>>>>>>>>
>>>>>>>> Block Size: 131072b+
>>>>>>>> No. of Reads: 524156
>>>>>>>> No. of Writes: 296071
>>>>>>>> %-latency Avg-latency Min-Latency Max-Latency
No. of
>>>>>>>> calls Fop
>>>>>>>> --------- ----------- ----------- -----------
>>>>>>>> ------------ ----
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 4189 RELEASE
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 1257 RELEASEDIR
>>>>>>>> 0.00 46.19 us 12.00 us 187.00 us
>>>>>>>> 69 FLUSH
>>>>>>>> 0.00 147.00 us 78.00 us 367.00 us
>>>>>>>> 86 REMOVEXATTR
>>>>>>>> 0.00 223.46 us 24.00 us 1166.00 us
>>>>>>>> 149 READDIR
>>>>>>>> 0.00 565.34 us 76.00 us 3639.00 us
>>>>>>>> 88 FTRUNCATE
>>>>>>>> 0.00 263.28 us 20.00 us 28385.00 us
>>>>>>>> 228 LK
>>>>>>>> 0.00 98.84 us 2.00 us 880.00 us
>>>>>>>> 1198 OPENDIR
>>>>>>>> 0.00 91.59 us 26.00 us 10371.00 us
>>>>>>>> 3853 STATFS
>>>>>>>> 0.00 494.14 us 17.00 us 193439.00 us
>>>>>>>> 1171 GETXATTR
>>>>>>>> 0.00 299.42 us 35.00 us 9799.00 us
>>>>>>>> 2044 READDIRP
>>>>>>>> 0.00 1965.31 us 110.00 us 382258.00 us
>>>>>>>> 321 XATTROP
>>>>>>>> 0.01 113.40 us 24.00 us 61061.00 us
>>>>>>>> 8134 STAT
>>>>>>>> 0.01 755.38 us 57.00 us 607603.00 us
>>>>>>>> 3196 DISCARD
>>>>>>>> 0.05 2690.09 us 58.00 us 2704761.00 us
>>>>>>>> 3206 OPEN
>>>>>>>> 0.10 119978.25 us 97.00 us 9406684.00 us
>>>>>>>> 154 SETATTR
>>>>>>>> 0.18 101.73 us 28.00 us 700477.00 us
>>>>>>>> 313379 FSTAT
>>>>>>>> 0.23 1059.84 us 25.00 us 2716124.00 us
>>>>>>>> 38255 LOOKUP
>>>>>>>> 0.47 1024.11 us 54.00 us 6197164.00 us
>>>>>>>> 81455 FXATTROP
>>>>>>>> 1.72 2984.00 us 15.00 us 37098954.00 us
>>>>>>>> 103020 FINODELK
>>>>>>>> 5.92 44315.32 us 51.00 us 24731536.00 us
>>>>>>>> 23957 FSYNC
>>>>>>>> 13.27 2399.78 us 25.00 us 22089540.00 us
>>>>>>>> 991005 READ
>>>>>>>> 37.00 5980.43 us 52.00 us 22099889.00 us
>>>>>>>> 1108976 WRITE
>>>>>>>> 41.04 5452.75 us 13.00 us 22102452.00 us
>>>>>>>> 1349053 INODELK
>>>>>>>>
>>>>>>>> Duration: 10026 seconds
>>>>>>>> Data Read: 80046027759 bytes
>>>>>>>> Data Written: 44496632320 bytes
>>>>>>>>
>>>>>>>> Interval 1 Stats:
>>>>>>>> Block Size: 256b+
512b+
>>>>>>>> 1024b+
>>>>>>>> No. of Reads: 983
2696
>>>>>>>> 1059
>>>>>>>> No. of Writes: 0
838
>>>>>>>> 185
>>>>>>>>
>>>>>>>> Block Size: 2048b+
4096b+
>>>>>>>> 8192b+
>>>>>>>> No. of Reads: 852
85856
>>>>>>>> 51575
>>>>>>>> No. of Writes: 382
705802
>>>>>>>> 57812
>>>>>>>>
>>>>>>>> Block Size: 16384b+
32768b+
>>>>>>>> 65536b+
>>>>>>>> No. of Reads: 52673
232093
>>>>>>>> 14984
>>>>>>>> No. of Writes: 13499
4908
>>>>>>>> 4242
>>>>>>>>
>>>>>>>> Block Size: 131072b+
>>>>>>>> No. of Reads: 460040
>>>>>>>> No. of Writes: 6411
>>>>>>>> %-latency Avg-latency Min-Latency Max-Latency
No. of
>>>>>>>> calls Fop
>>>>>>>> --------- ----------- ----------- -----------
>>>>>>>> ------------ ----
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 2093 RELEASE
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 1093 RELEASEDIR
>>>>>>>> 0.00 53.38 us 26.00 us 111.00 us
>>>>>>>> 16 FLUSH
>>>>>>>> 0.00 145.14 us 78.00 us 367.00 us
>>>>>>>> 71 REMOVEXATTR
>>>>>>>> 0.00 190.96 us 114.00 us 298.00 us
>>>>>>>> 71 SETATTR
>>>>>>>> 0.00 213.38 us 24.00 us 1145.00 us
>>>>>>>> 90 READDIR
>>>>>>>> 0.00 263.28 us 20.00 us 28385.00 us
>>>>>>>> 228 LK
>>>>>>>> 0.00 101.76 us 2.00 us 880.00 us
>>>>>>>> 1093 OPENDIR
>>>>>>>> 0.01 93.60 us 27.00 us 10371.00 us
>>>>>>>> 3090 STATFS
>>>>>>>> 0.02 537.47 us 17.00 us 193439.00 us
>>>>>>>> 1038 GETXATTR
>>>>>>>> 0.03 297.44 us 35.00 us 9799.00 us
>>>>>>>> 1990 READDIRP
>>>>>>>> 0.03 2357.28 us 110.00 us 382258.00 us
>>>>>>>> 253 XATTROP
>>>>>>>> 0.04 385.93 us 58.00 us 47593.00 us
>>>>>>>> 2091 OPEN
>>>>>>>> 0.04 114.86 us 24.00 us 61061.00 us
>>>>>>>> 7715 STAT
>>>>>>>> 0.06 444.59 us 57.00 us 333240.00 us
>>>>>>>> 3053 DISCARD
>>>>>>>> 0.42 316.24 us 25.00 us 290728.00 us
>>>>>>>> 29823 LOOKUP
>>>>>>>> 0.73 257.92 us 54.00 us 344812.00 us
>>>>>>>> 63296 FXATTROP
>>>>>>>> 1.37 98.30 us 28.00 us 67621.00 us
>>>>>>>> 313172 FSTAT
>>>>>>>> 1.58 2124.69 us 51.00 us 849200.00 us
>>>>>>>> 16717 FSYNC
>>>>>>>> 5.73 162.46 us 52.00 us 748492.00 us
>>>>>>>> 794079 WRITE
>>>>>>>> 7.19 2065.17 us 16.00 us 37098954.00 us
>>>>>>>> 78381 FINODELK
>>>>>>>> 36.44 886.32 us 25.00 us 2216436.00 us
>>>>>>>> 925421 READ
>>>>>>>> 46.30 1178.04 us 13.00 us 1700704.00 us
>>>>>>>> 884635 INODELK
>>>>>>>>
>>>>>>>> Duration: 7485 seconds
>>>>>>>> Data Read: 71250527215 bytes
>>>>>>>> Data Written: 5119903744 bytes
>>>>>>>>
>>>>>>>> Brick: ovirt3.nwfiber.com:/gluster/brick2/data
>>>>>>>> ----------------------------------------------
>>>>>>>> Cumulative Stats:
>>>>>>>> Block Size: 1b+
>>>>>>>> No. of Reads: 0
>>>>>>>> No. of Writes: 3264419
>>>>>>>> %-latency Avg-latency Min-Latency Max-Latency
No. of
>>>>>>>> calls Fop
>>>>>>>> --------- ----------- ----------- -----------
>>>>>>>> ------------ ----
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 90 FORGET
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 9462 RELEASE
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 4254 RELEASEDIR
>>>>>>>> 0.00 50.52 us 13.00 us 190.00 us
>>>>>>>> 71 FLUSH
>>>>>>>> 0.00 186.97 us 87.00 us 713.00 us
>>>>>>>> 86 REMOVEXATTR
>>>>>>>> 0.00 79.32 us 33.00 us 189.00 us
>>>>>>>> 228 LK
>>>>>>>> 0.00 220.98 us 129.00 us 513.00 us
>>>>>>>> 86 SETATTR
>>>>>>>> 0.01 259.30 us 26.00 us 2632.00 us
>>>>>>>> 137 READDIR
>>>>>>>> 0.02 322.76 us 145.00 us 2125.00 us
>>>>>>>> 321 XATTROP
>>>>>>>> 0.03 109.55 us 2.00 us 1258.00 us
>>>>>>>> 1193 OPENDIR
>>>>>>>> 0.05 70.21 us 21.00 us 431.00 us
>>>>>>>> 3196 DISCARD
>>>>>>>> 0.05 169.26 us 21.00 us 2315.00 us
>>>>>>>> 1545 GETXATTR
>>>>>>>> 0.12 176.85 us 63.00 us 2844.00 us
>>>>>>>> 3206 OPEN
>>>>>>>> 0.61 303.49 us 90.00 us 3085.00 us
>>>>>>>> 9633 FSTAT
>>>>>>>> 2.44 305.66 us 28.00 us 3716.00 us
>>>>>>>> 38230 LOOKUP
>>>>>>>> 4.52 266.22 us 55.00 us 53424.00 us
>>>>>>>> 81455 FXATTROP
>>>>>>>> 6.96 1397.99 us 51.00 us 64822.00 us
>>>>>>>> 23889 FSYNC
>>>>>>>> 16.48 84.74 us 25.00 us 6917.00 us
>>>>>>>> 932592 WRITE
>>>>>>>> 30.16 106.90 us 13.00 us 3920189.00 us
>>>>>>>> 1353046 INODELK
>>>>>>>> 38.55 1794.52 us 14.00 us 16210553.00 us
>>>>>>>> 103039 FINODELK
>>>>>>>>
>>>>>>>> Duration: 66562 seconds
>>>>>>>> Data Read: 0 bytes
>>>>>>>> Data Written: 3264419 bytes
>>>>>>>>
>>>>>>>> Interval 1 Stats:
>>>>>>>> Block Size: 1b+
>>>>>>>> No. of Reads: 0
>>>>>>>> No. of Writes: 794080
>>>>>>>> %-latency Avg-latency Min-Latency Max-Latency
No. of
>>>>>>>> calls Fop
>>>>>>>> --------- ----------- ----------- -----------
>>>>>>>> ------------ ----
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 2093 RELEASE
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 1093 RELEASEDIR
>>>>>>>> 0.00 70.31 us 26.00 us 125.00 us
>>>>>>>> 16 FLUSH
>>>>>>>> 0.00 193.10 us 103.00 us 713.00 us
>>>>>>>> 71 REMOVEXATTR
>>>>>>>> 0.01 227.32 us 133.00 us 513.00 us
>>>>>>>> 71 SETATTR
>>>>>>>> 0.01 79.32 us 33.00 us 189.00 us
>>>>>>>> 228 LK
>>>>>>>> 0.01 259.83 us 35.00 us 1138.00 us
>>>>>>>> 89 READDIR
>>>>>>>> 0.03 318.26 us 145.00 us 2047.00 us
>>>>>>>> 253 XATTROP
>>>>>>>> 0.04 112.67 us 3.00 us 1258.00 us
>>>>>>>> 1093 OPENDIR
>>>>>>>> 0.06 167.98 us 23.00 us 1951.00 us
>>>>>>>> 1014 GETXATTR
>>>>>>>> 0.08 70.97 us 22.00 us 431.00 us
>>>>>>>> 3053 DISCARD
>>>>>>>> 0.13 183.78 us 66.00 us 2844.00 us
>>>>>>>> 2091 OPEN
>>>>>>>> 1.01 303.82 us 90.00 us 3085.00 us
>>>>>>>> 9610 FSTAT
>>>>>>>> 3.27 316.59 us 30.00 us 3716.00 us
>>>>>>>> 29820 LOOKUP
>>>>>>>> 5.83 265.79 us 59.00 us 53424.00 us
>>>>>>>> 63296 FXATTROP
>>>>>>>> 7.95 1373.89 us 51.00 us 64822.00 us
>>>>>>>> 16717 FSYNC
>>>>>>>> 23.17 851.99 us 14.00 us 16210553.00 us
>>>>>>>> 78555 FINODELK
>>>>>>>> 24.04 87.44 us 27.00 us 6917.00 us
>>>>>>>> 794081 WRITE
>>>>>>>> 34.36 111.91 us 14.00 us 984871.00 us
>>>>>>>> 886790 INODELK
>>>>>>>>
>>>>>>>> Duration: 7485 seconds
>>>>>>>> Data Read: 0 bytes
>>>>>>>> Data Written: 794080 bytes
>>>>>>>>
>>>>>>>>
>>>>>>>> -----------------------
>>>>>>>> Here is the data from the volume that is backed by the
SHDDs and
>>>>>>>> has one failed disk:
>>>>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile
data-hdd info
>>>>>>>> Brick: 172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>> --------------------------------------------
>>>>>>>> Cumulative Stats:
>>>>>>>> Block Size: 256b+
512b+
>>>>>>>> 1024b+
>>>>>>>> No. of Reads: 1702
86
>>>>>>>> 16
>>>>>>>> No. of Writes: 0
767
>>>>>>>> 71
>>>>>>>>
>>>>>>>> Block Size: 2048b+
4096b+
>>>>>>>> 8192b+
>>>>>>>> No. of Reads: 19
51841
>>>>>>>> 2049
>>>>>>>> No. of Writes: 76
60668
>>>>>>>> 35727
>>>>>>>>
>>>>>>>> Block Size: 16384b+
32768b+
>>>>>>>> 65536b+
>>>>>>>> No. of Reads: 1744
639
>>>>>>>> 1088
>>>>>>>> No. of Writes: 8524
2410
>>>>>>>> 1285
>>>>>>>>
>>>>>>>> Block Size: 131072b+
>>>>>>>> No. of Reads: 771999
>>>>>>>> No. of Writes: 29584
>>>>>>>> %-latency Avg-latency Min-Latency Max-Latency
No. of
>>>>>>>> calls Fop
>>>>>>>> --------- ----------- ----------- -----------
>>>>>>>> ------------ ----
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 2902 RELEASE
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 1517 RELEASEDIR
>>>>>>>> 0.00 197.00 us 197.00 us 197.00 us
>>>>>>>> 1 FTRUNCATE
>>>>>>>> 0.00 70.24 us 16.00 us 758.00 us
>>>>>>>> 51 FLUSH
>>>>>>>> 0.00 143.93 us 82.00 us 305.00 us
>>>>>>>> 57 REMOVEXATTR
>>>>>>>> 0.00 178.63 us 105.00 us 712.00 us
>>>>>>>> 60 SETATTR
>>>>>>>> 0.00 67.30 us 19.00 us 572.00 us
>>>>>>>> 555 LK
>>>>>>>> 0.00 322.80 us 23.00 us 4673.00 us
>>>>>>>> 138 READDIR
>>>>>>>> 0.00 336.56 us 106.00 us 11994.00 us
>>>>>>>> 237 XATTROP
>>>>>>>> 0.00 84.70 us 28.00 us 1071.00 us
>>>>>>>> 3469 STATFS
>>>>>>>> 0.01 387.75 us 2.00 us 146017.00 us
>>>>>>>> 1467 OPENDIR
>>>>>>>> 0.01 148.59 us 21.00 us 64374.00 us
>>>>>>>> 4454 STAT
>>>>>>>> 0.02 783.02 us 16.00 us 93502.00 us
>>>>>>>> 1902 GETXATTR
>>>>>>>> 0.03 1516.10 us 17.00 us 210690.00 us
>>>>>>>> 1364 ENTRYLK
>>>>>>>> 0.03 2555.47 us 300.00 us 674454.00 us
>>>>>>>> 1064 READDIRP
>>>>>>>> 0.07 85.74 us 19.00 us 68340.00 us
>>>>>>>> 62849 FSTAT
>>>>>>>> 0.07 1978.12 us 59.00 us 202596.00 us
>>>>>>>> 2729 OPEN
>>>>>>>> 0.22 708.57 us 15.00 us 394799.00 us
>>>>>>>> 25447 LOOKUP
>>>>>>>> 5.94 2331.74 us 15.00 us 1099530.00 us
>>>>>>>> 207534 FINODELK
>>>>>>>> 7.31 8311.75 us 58.00 us 1800216.00 us
>>>>>>>> 71668 FXATTROP
>>>>>>>> 12.49 7735.19 us 51.00 us 3595513.00 us
>>>>>>>> 131642 WRITE
>>>>>>>> 17.70 957.08 us 16.00 us 13700466.00 us
>>>>>>>> 1508160 INODELK
>>>>>>>> 24.55 2546.43 us 26.00 us 5077347.00 us
>>>>>>>> 786060 READ
>>>>>>>> 31.56 49699.15 us 47.00 us 3746331.00 us
>>>>>>>> 51777 FSYNC
>>>>>>>>
>>>>>>>> Duration: 10101 seconds
>>>>>>>> Data Read: 101562897361 bytes
>>>>>>>> Data Written: 4834450432 bytes
>>>>>>>>
>>>>>>>> Interval 0 Stats:
>>>>>>>> Block Size: 256b+
512b+
>>>>>>>> 1024b+
>>>>>>>> No. of Reads: 1702
86
>>>>>>>> 16
>>>>>>>> No. of Writes: 0
767
>>>>>>>> 71
>>>>>>>>
>>>>>>>> Block Size: 2048b+
4096b+
>>>>>>>> 8192b+
>>>>>>>> No. of Reads: 19
51841
>>>>>>>> 2049
>>>>>>>> No. of Writes: 76
60668
>>>>>>>> 35727
>>>>>>>>
>>>>>>>> Block Size: 16384b+
32768b+
>>>>>>>> 65536b+
>>>>>>>> No. of Reads: 1744
639
>>>>>>>> 1088
>>>>>>>> No. of Writes: 8524
2410
>>>>>>>> 1285
>>>>>>>>
>>>>>>>> Block Size: 131072b+
>>>>>>>> No. of Reads: 771999
>>>>>>>> No. of Writes: 29584
>>>>>>>> %-latency Avg-latency Min-Latency Max-Latency
No. of
>>>>>>>> calls Fop
>>>>>>>> --------- ----------- ----------- -----------
>>>>>>>> ------------ ----
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 2902 RELEASE
>>>>>>>> 0.00 0.00 us 0.00 us 0.00 us
>>>>>>>> 1517 RELEASEDIR
>>>>>>>> 0.00 197.00 us 197.00 us 197.00 us
>>>>>>>> 1 FTRUNCATE
>>>>>>>> 0.00 70.24 us 16.00 us 758.00 us
>>>>>>>> 51 FLUSH
>>>>>>>> 0.00 143.93 us 82.00 us 305.00 us
>>>>>>>> 57 REMOVEXATTR
>>>>>>>> 0.00 178.63 us 105.00 us 712.00 us
>>>>>>>> 60 SETATTR
>>>>>>>> 0.00 67.30 us 19.00 us 572.00 us
>>>>>>>> 555 LK
>>>>>>>> 0.00 322.80 us 23.00 us 4673.00 us
>>>>>>>> 138 READDIR
>>>>>>>> 0.00 336.56 us 106.00 us 11994.00 us
>>>>>>>> 237 XATTROP
>>>>>>>> 0.00 84.70 us 28.00 us 1071.00 us
>>>>>>>> 3469 STATFS
>>>>>>>> 0.01 387.75 us 2.00 us 146017.00 us
>>>>>>>> 1467 OPENDIR
>>>>>>>> 0.01 148.59 us 21.00 us 64374.00 us
>>>>>>>> 4454 STAT
>>>>>>>> 0.02 783.02 us 16.00 us 93502.00 us
>>>>>>>> 1902 GETXATTR
>>>>>>>> 0.03 1516.10 us 17.00 us 210690.00 us
>>>>>>>> 1364 ENTRYLK
>>>>>>>> 0.03 2555.47 us 300.00 us 674454.00 us
>>>>>>>> 1064 READDIRP
>>>>>>>> 0.07 85.73 us 19.00 us 68340.00 us
>>>>>>>> 62849 FSTAT
>>>>>>>> 0.07 1978.12 us 59.00 us 202596.00 us
>>>>>>>> 2729 OPEN
>>>>>>>> 0.22 708.57 us 15.00 us 394799.00 us
>>>>>>>> 25447 LOOKUP
>>>>>>>> 5.94 2334.57 us 15.00 us 1099530.00 us
>>>>>>>> 207534 FINODELK
>>>>>>>> 7.31 8311.49 us 58.00 us 1800216.00 us
>>>>>>>> 71668 FXATTROP
>>>>>>>> 12.49 7735.32 us 51.00 us 3595513.00 us
>>>>>>>> 131642 WRITE
>>>>>>>> 17.71 957.08 us 16.00 us 13700466.00 us
>>>>>>>> 1508160 INODELK
>>>>>>>> 24.56 2546.42 us 26.00 us 5077347.00 us
>>>>>>>> 786060 READ
>>>>>>>> 31.54 49651.63 us 47.00 us 3746331.00 us
>>>>>>>> 51777 FSYNC
>>>>>>>>
>>>>>>>> Duration: 10101 seconds
>>>>>>>> Data Read: 101562897361 bytes
>>>>>>>> Data Written: 4834450432 bytes
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you for your response.
>>>>>>>>>
>>>>>>>>> I have 4 gluster volumes. 3 are replica 2 +
arbitrator. replica
>>>>>>>>> bricks are on ovirt1 and ovirt2, arbitrator on
ovirt3. The 4th volume is
>>>>>>>>> replica 3, with a brick on all three ovirt machines.
>>>>>>>>>
>>>>>>>>> The first 3 volumes are on an SSD disk; the 4th is on
a Seagate
>>>>>>>>> SSHD (same in all three machines). On ovirt3, the
SSHD has reported hard
>>>>>>>>> IO failures, and that brick is offline. However, the
other two replicas
>>>>>>>>> are fully operational (although they still show
contents in the heal info
>>>>>>>>> command that won't go away, but that may be the
case until I replace the
>>>>>>>>> failed disk).
>>>>>>>>>
>>>>>>>>> What is bothering me is that ALL 4 gluster volumes
are showing
>>>>>>>>> horrible performance issues. At this point, as the
bad disk has been
>>>>>>>>> completely offlined, I would expect gluster to
perform at normal speed, but
>>>>>>>>> that is definitely not the case.
>>>>>>>>>
>>>>>>>>> I've also noticed that the performance hits seem
to come in
>>>>>>>>> waves: things seem to work acceptably (but slow) for
a while, then
>>>>>>>>> suddenly, its as if all disk IO on all volumes
(including non-gluster local
>>>>>>>>> OS disk volumes for the hosts) pause for about 30
seconds, then IO resumes
>>>>>>>>> again. During those times, I start getting VM not
responding and host not
>>>>>>>>> responding notices as well as the applications having
major issues.
>>>>>>>>>
>>>>>>>>> I've shut down most of my VMs and am down to just
my essential
>>>>>>>>> core VMs (shedded about 75% of my VMs). I still am
experiencing the same
>>>>>>>>> issues.
>>>>>>>>>
>>>>>>>>> Am I correct in believing that once the failed disk
was brought
>>>>>>>>> offline that performance should return to normal?
>>>>>>>>>
>>>>>>>>> On Tue, May 29, 2018 at 1:27 PM, Alex K
<rightkicktech(a)gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I would check disks status and accessibility of
mount points
>>>>>>>>>> where your gluster volumes reside.
>>>>>>>>>>
>>>>>>>>>> On Tue, May 29, 2018, 22:28 Jim Kusznir
<jim(a)palousetech.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> On one ovirt server, I'm now seeing these
messages:
>>>>>>>>>>> [56474.239725] blk_update_request: 63
callbacks suppressed
>>>>>>>>>>> [56474.239732] blk_update_request: I/O error,
dev dm-2, sector 0
>>>>>>>>>>> [56474.240602] blk_update_request: I/O error,
dev dm-2, sector
>>>>>>>>>>> 3905945472
>>>>>>>>>>> [56474.241346] blk_update_request: I/O error,
dev dm-2, sector
>>>>>>>>>>> 3905945584
>>>>>>>>>>> [56474.242236] blk_update_request: I/O error,
dev dm-2, sector
>>>>>>>>>>> 2048
>>>>>>>>>>> [56474.243072] blk_update_request: I/O error,
dev dm-2, sector
>>>>>>>>>>> 3905943424
>>>>>>>>>>> [56474.243997] blk_update_request: I/O error,
dev dm-2, sector
>>>>>>>>>>> 3905943536
>>>>>>>>>>> [56474.247347] blk_update_request: I/O error,
dev dm-2, sector 0
>>>>>>>>>>> [56474.248315] blk_update_request: I/O error,
dev dm-2, sector
>>>>>>>>>>> 3905945472
>>>>>>>>>>> [56474.249231] blk_update_request: I/O error,
dev dm-2, sector
>>>>>>>>>>> 3905945584
>>>>>>>>>>> [56474.250221] blk_update_request: I/O error,
dev dm-2, sector
>>>>>>>>>>> 2048
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir
<
>>>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I see in messages on ovirt3 (my 3rd
machine, the one upgraded
>>>>>>>>>>>> to 4.2):
>>>>>>>>>>>>
>>>>>>>>>>>> May 29 11:54:41 ovirt3 ovs-vsctl:
>>>>>>>>>>>>
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>>> database connection failed (No such file
or directory)
>>>>>>>>>>>> May 29 11:54:51 ovirt3 ovs-vsctl:
>>>>>>>>>>>>
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>>> database connection failed (No such file
or directory)
>>>>>>>>>>>> May 29 11:55:01 ovirt3 ovs-vsctl:
>>>>>>>>>>>>
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>>>> database connection failed (No such file
or directory)
>>>>>>>>>>>> (appears a lot).
>>>>>>>>>>>>
>>>>>>>>>>>> I also found on the ssh session of that,
some sysv warnings
>>>>>>>>>>>> about the backing disk for one of the
gluster volumes (straight replica
>>>>>>>>>>>> 3). The glusterfs process for that disk
on that machine went offline. Its
>>>>>>>>>>>> my understanding that it should continue
to work with the other two
>>>>>>>>>>>> machines while I attempt to replace that
disk, right? Attempted writes
>>>>>>>>>>>> (touching an empty file) can take 15
seconds, repeating it later will be
>>>>>>>>>>>> much faster.
>>>>>>>>>>>>
>>>>>>>>>>>> Gluster generates a bunch of different
log files, I don't know
>>>>>>>>>>>> what ones you want, or from which
machine(s).
>>>>>>>>>>>>
>>>>>>>>>>>> How do I do "volume
profiling"?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 29, 2018 at 11:53 AM, Sahina
Bose <
>>>>>>>>>>>> sabose(a)redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Do you see errors reported in the
mount logs for the volume?
>>>>>>>>>>>>> If so, could you attach the logs?
>>>>>>>>>>>>> Any issues with your underlying
disks. Can you also attach
>>>>>>>>>>>>> output of volume profiling?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, May 30, 2018 at 12:13 AM, Jim
Kusznir <
>>>>>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok, things have gotten MUCH worse
this morning. I'm getting
>>>>>>>>>>>>>> random errors from VMs, right
now, about a third of my VMs have been paused
>>>>>>>>>>>>>> due to storage issues, and most
of the remaining VMs are not performing
>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> At this point, I am in full
EMERGENCY mode, as my production
>>>>>>>>>>>>>> services are now impacted, and
I'm getting calls coming in with problems...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd greatly appreciate
help...VMs are running VERY slowly
>>>>>>>>>>>>>> (when they run), and they are
steadily getting worse. I don't know why. I
>>>>>>>>>>>>>> was seeing CPU peaks (to 100%) on
several VMs, in perfect sync, for a few
>>>>>>>>>>>>>> minutes at a time (while the VM
became unresponsive and any VMs I was
>>>>>>>>>>>>>> logged into that were linux were
giving me the CPU stuck messages in my
>>>>>>>>>>>>>> origional post). Is all this
storage related?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also have two different gluster
volumes for VM storage,
>>>>>>>>>>>>>> and only one had the issues, but
now VMs in both are being affected at the
>>>>>>>>>>>>>> same time and same way.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, May 28, 2018 at 10:50 PM,
Sahina Bose <
>>>>>>>>>>>>>> sabose(a)redhat.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Adding gluster-users to look
at the heal issue]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, May 29, 2018 at 9:17
AM, Jim Kusznir <
>>>>>>>>>>>>>>> jim(a)palousetech.com>
wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've been having some
cluster and gluster performance
>>>>>>>>>>>>>>>> issues lately. I also
found that my cluster was out of date, and was
>>>>>>>>>>>>>>>> trying to apply updates
(hoping to fix some of these), and discovered the
>>>>>>>>>>>>>>>> ovirt 4.1 repos were
taken completely offline. So, I was forced to begin
>>>>>>>>>>>>>>>> an upgrade to 4.2.
According to docs I found/read, I needed only add the
>>>>>>>>>>>>>>>> new repo, do a yum
update, reboot, and be good on my hosts (did the yum
>>>>>>>>>>>>>>>> update, the engine-setup
on my hosted engine). Things seemed to work
>>>>>>>>>>>>>>>> relatively well, except
for a gluster sync issue that showed up.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My cluster is a 3 node
hyperconverged cluster. I upgraded
>>>>>>>>>>>>>>>> the hosted engine first,
then engine 3. When engine 3 came back up, for
>>>>>>>>>>>>>>>> some reason one of my
gluster volumes would not sync. Here's sample output:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [root@ovirt3 ~]# gluster
volume heal data-hdd info
>>>>>>>>>>>>>>>> Brick
172.172.1.11:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0
>>>>>>>>>>>>>>>>
-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300
>>>>>>>>>>>>>>>>
-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f
>>>>>>>>>>>>>>>>
-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470
>>>>>>>>>>>>>>>>
-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad
>>>>>>>>>>>>>>>>
-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a
>>>>>>>>>>>>>>>>
-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231
>>>>>>>>>>>>>>>>
-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69
>>>>>>>>>>>>>>>>
-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brick
172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0
>>>>>>>>>>>>>>>>
-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300
>>>>>>>>>>>>>>>>
-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a
>>>>>>>>>>>>>>>>
-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f
>>>>>>>>>>>>>>>>
-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470
>>>>>>>>>>>>>>>>
-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231
>>>>>>>>>>>>>>>>
-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69
>>>>>>>>>>>>>>>>
-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad
>>>>>>>>>>>>>>>>
-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brick
172.172.1.13:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a
>>>>>>>>>>>>>>>>
-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231
>>>>>>>>>>>>>>>>
-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69
>>>>>>>>>>>>>>>>
-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad
>>>>>>>>>>>>>>>>
-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300
>>>>>>>>>>>>>>>>
-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0
>>>>>>>>>>>>>>>>
-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f
>>>>>>>>>>>>>>>>
-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7
>>>>>>>>>>>>>>>>
c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470
>>>>>>>>>>>>>>>>
-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ---------
>>>>>>>>>>>>>>>> Its been in this state
for a couple days now, and
>>>>>>>>>>>>>>>> bandwidth monitoring
shows no appreciable data moving. I've tried
>>>>>>>>>>>>>>>> repeatedly commanding a
full heal from all three clusters in the node. Its
>>>>>>>>>>>>>>>> always the same files
that need healing.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When running gluster
volume heal data-hdd statistics, I
>>>>>>>>>>>>>>>> see sometimes different
information, but always some number of "heal
>>>>>>>>>>>>>>>> failed" entries. It
shows 0 for split brain.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not quite sure
what to do. I suspect it may be due to
>>>>>>>>>>>>>>>> nodes 1 and 2 still being
on the older ovirt/gluster release, but I'm
>>>>>>>>>>>>>>>> afraid to upgrade and
reboot them until I have a good gluster sync (don't
>>>>>>>>>>>>>>>> need to create a split
brain issue). How do I proceed with this?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Second issue: I've
been experiencing VERY POOR performance
>>>>>>>>>>>>>>>> on most of my VMs. To
the tune that logging into a windows 10 vm via
>>>>>>>>>>>>>>>> remote desktop can take 5
minutes, launching quickbooks inside said vm can
>>>>>>>>>>>>>>>> easily take 10 minutes.
On some linux VMs, I get random messages like this:
>>>>>>>>>>>>>>>> Message from
syslogd@unifi at May 28 20:39:23 ...
>>>>>>>>>>>>>>>> kernel:[6171996.308904]
NMI watchdog: BUG: soft lockup -
>>>>>>>>>>>>>>>> CPU#0 stuck for 22s!
[mongod:14766]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (the process and PID are
often different)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not quite sure
what to do about this either. My
>>>>>>>>>>>>>>>> initial thought was
upgrad everything to current and see if its still
>>>>>>>>>>>>>>>> there, but I cannot move
forward with that until my gluster is healed...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>>>> Users mailing list --
users(a)ovirt.org
>>>>>>>>>>>>>>>> To unsubscribe send an
email to users-leave(a)ovirt.org
>>>>>>>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/pri
>>>>>>>>>>>>>>>> vacy-policy/
>>>>>>>>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>>>>>>>>>
y/about/community-guidelines/
>>>>>>>>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>>>>>>>>
es/list/users(a)ovirt.org/messag
>>>>>>>>>>>>>>>>
e/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> Users mailing list -- users(a)ovirt.org
>>>>>>>>>>> To unsubscribe send an email to
users-leave(a)ovirt.org
>>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>>>> y/about/community-guidelines/
>>>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>>>
es/list/users(a)ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDE
>>>>>>>>>>> S5MF/
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list -- users(a)ovirt.org
>>>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>> y/about/community-guidelines/
>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>> es/list/users(a)ovirt.org/message/JKBSLSZFRMFGILXGZB4YAMJIJKEKNGMA/
>>>>>
>>>>>
>>>>
>>>
>>
>