So, I appear to be at the point where gluster appears to be functional, but
engine still won't start. Here's from agent.log on one machine after
giving the hosted-engine --vm-start command:
MainThread::INFO::2018-05-29
23:52:09,953::hosted_engine::948::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
MainThread::INFO::2018-05-29
23:52:10,230::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineStart-EngineStarting)
sent? sent
MainThread::INFO::2018-05-29
23:56:16,061::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
MainThread::INFO::2018-05-29
23:56:16,075::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
VM is unexpectedly down.
MainThread::INFO::2018-05-29
23:56:16,076::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
Timeout cleared while transitioning <class
'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class
'ovirt_hosted_engine_ha.agent.states.EngineMaybeAway'>
MainThread::INFO::2018-05-29
23:56:16,359::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition
(EngineStarting-EngineMaybeAway) sent? sent
MainThread::INFO::2018-05-29
23:56:16,554::ovf_store::151::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
OVF_STORE volume path:
/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/f22829ab-9fd5-415a-9a8f-809d3f7887d4/9f4760ee-119c-412a-a1e8-49e73e6ba929
---------------
here's an excerpt from vdsm.log:
2018-05-30 00:00:23,114-0700 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC
call VM.create succeeded in 0.02 seconds (__init__:573)
2018-05-30 00:00:23,115-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') VM wrapper has started
(vm:2764)
2018-05-30 00:00:23,125-0700 INFO (vm/50392390) [vdsm.api] START
getVolumeSize(sdUUID='c0acdefb-7d16-48ec-9d76-659b8fe33e2a',
spUUID='00000000-0000-0000-0000-000000000000',
imgUUID='3de00d8c-d8b0-4ae0-9363-38a504f5d2b2',
volUUID='d0fcd8de-6105-4f33-a674-727e3a11e89f', options=None)
from=internal, task_id=c6d74ce9-1944-457a-9deb-d0a6929f39ce (api:46)
2018-05-30 00:00:23,130-0700 INFO (vm/50392390) [vdsm.api] FINISH
getVolumeSize return={'truesize': '11298571776', 'apparentsize':
'10737418240'} from=internal, task_id=c6d74ce9-1944-457a-9deb-d0a6929f39ce
(api:52)
2018-05-30 00:00:23,131-0700 INFO (vm/50392390) [vds] prepared volume
path: (clientIF:497)
2018-05-30 00:00:23,131-0700 INFO (vm/50392390) [vdsm.api] START
prepareImage(sdUUID='c0acdefb-7d16-48ec-9d76-659b8fe33e2a',
spUUID='00000000-0000-0000-0000-000000000000',
imgUUID='3de00d8c-d8b0-4ae0-9363-38a504f5d2b2',
leafUUID='d0fcd8de-6105-4f33-a674-727e3a11e89f', allowIllegal=False)
from=internal, task_id=d2797364-270f-4800-92b0-d114441194a3 (api:46)
2018-05-30 00:00:23,155-0700 INFO (vm/50392390) [storage.StorageDomain]
Fixing permissions on
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f
(fileSD:622)
2018-05-30 00:00:23,156-0700 INFO (vm/50392390) [storage.StorageDomain]
Creating domain run directory
u'/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a' (fileSD:576)
2018-05-30 00:00:23,156-0700 INFO (vm/50392390) [storage.fileUtils]
Creating directory:
/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a mode: None
(fileUtils:197)
2018-05-30 00:00:23,157-0700 INFO (vm/50392390) [storage.StorageDomain]
Creating symlink from
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2
to
/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2
(fileSD:579)
2018-05-30 00:00:23,228-0700 INFO (jsonrpc/0) [root]
/usr/libexec/vdsm/hooks/after_get_caps/openstacknet_utils.py: rc=0 err=
(hooks:110)
2018-05-30 00:00:23,418-0700 INFO (vm/50392390) [vdsm.api] FINISH
prepareImage return={'info': {'path':
u'engine/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f',
'type': 'network', 'hosts': [{'port': '0',
'transport': 'tcp', 'name':
'192.168.8.11'}, {'port': '0', 'transport': 'tcp',
'name': '
ovirt1.nwfiber.com'}, {'port': '0', 'transport':
'tcp', 'name': '
ovirt2.nwfiber.com'}, {'port': '0', 'transport':
'tcp', 'name': '
ovirt3.nwfiber.com'}], 'protocol': 'gluster'}, 'path':
u'/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f',
'imgVolumesInfo': [{'domainID':
'c0acdefb-7d16-48ec-9d76-659b8fe33e2a',
'leaseOffset': 0, 'path':
u'/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f',
'volumeID': u'd0fcd8de-6105-4f33-a674-727e3a11e89f', 'leasePath':
u'/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f.lease',
'imageID': '3de00d8c-d8b0-4ae0-9363-38a504f5d2b2'}]} from=internal,
task_id=d2797364-270f-4800-92b0-d114441194a3 (api:52)
2018-05-30 00:00:23,419-0700 INFO (vm/50392390) [vds] prepared volume
path:
/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f
(clientIF:497)
2018-05-30 00:00:23,420-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') Enabling drive monitoring
(drivemonitor:54)
2018-05-30 00:00:23,434-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') drive 'hdc' path:
'file=' ->
'*file=' (storagexml:323)
2018-05-30 00:00:23,435-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') drive 'vda' path:
'file=/rhev/data-center/00000000-0000-0000-0000-000000000000/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f'
->
u'*file=/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f'
(storagexml:323)
2018-05-30 00:00:23,435-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') Using drive lease:
{'domainID': 'c0acdefb-7d16-48ec-9d76-659b8fe33e2a',
'leaseOffset': 0,
'path':
u'/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f',
'volumeID': u'd0fcd8de-6105-4f33-a674-727e3a11e89f', 'leasePath':
u'/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f.lease',
'imageID': '3de00d8c-d8b0-4ae0-9363-38a504f5d2b2'} (lease:327)
2018-05-30 00:00:23,436-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') lease
c0acdefb-7d16-48ec-9d76-659b8fe33e2a:d0fcd8de-6105-4f33-a674-727e3a11e89f
path:
'LEASE-PATH:d0fcd8de-6105-4f33-a674-727e3a11e89f:c0acdefb-7d16-48ec-9d76-659b8fe33e2a'
->
u'/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f.lease'
(lease:316)
2018-05-30 00:00:23,436-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') lease
c0acdefb-7d16-48ec-9d76-659b8fe33e2a:d0fcd8de-6105-4f33-a674-727e3a11e89f
offset:
'LEASE-OFFSET:d0fcd8de-6105-4f33-a674-727e3a11e89f:c0acdefb-7d16-48ec-9d76-659b8fe33e2a'
-> 0 (lease:316)
2018-05-30 00:00:23,580-0700 INFO (jsonrpc/0) [root]
/usr/libexec/vdsm/hooks/after_get_caps/ovirt_provider_ovn_hook: rc=0
err=Failed to get Open VSwitch system-id . err = ['ovs-vsctl:
unix:/var/run/openvswitch/db.sock: database connection failed (No such file
or directory)']
(hooks:110)
2018-05-30 00:00:23,582-0700 INFO (jsonrpc/0) [api.host] FINISH
getCapabilities return={'status': {'message': 'Done',
'code': 0}, 'info':
{u'HBAInventory': {u'iSCSI': [{u'InitiatorName':
u'iqn.1994-05.com.redhat:1283ce7a3171'}], u'FC': []},
u'packages2':
{u'kernel': {u'release': u'862.3.2.el7.x86_64',
u'version': u'3.10.0'},
u'glusterfs-rdma': {u'release': u'1.el7', u'version':
u'3.12.9'},
u'glusterfs-fuse': {u'release': u'1.el7', u'version':
u'3.12.9'},
u'spice-server': {u'release': u'2.el7_5.3', u'version':
u'0.14.0'},
u'librbd1': {u'release': u'2.el7', u'version':
u'0.94.5'}, u'vdsm':
{u'release': u'1.el7.centos', u'version': u'4.20.27.1'},
u'qemu-kvm':
{u'release': u'21.el7_5.3.1', u'version': u'2.10.0'},
u'openvswitch':
{u'release': u'3.el7', u'version': u'2.9.0'},
u'libvirt': {u'release':
u'14.el7_5.5', u'version': u'3.9.0'},
u'ovirt-hosted-engine-ha':
{u'release': u'1.el7.centos', u'version': u'2.2.11'},
u'qemu-img':
{u'release': u'21.el7_5.3.1', u'version': u'2.10.0'},
u'mom': {u'release':
u'1.el7.centos', u'version': u'0.5.12'}, u'glusterfs':
{u'release':
u'1.el7', u'version': u'3.12.9'}, u'glusterfs-cli':
{u'release': u'1.el7',
u'version': u'3.12.9'}, u'glusterfs-server': {u'release':
u'1.el7',
u'version': u'3.12.9'}, u'glusterfs-geo-replication':
{u'release':
u'1.el7', u'version': u'3.12.9'}}, u'numaNodeDistance':
{u'1': [20, 10],
u'0': [10, 20]}, u'cpuModel': u'Intel(R) Xeon(R) CPU E5540
@
2.53GHz', u'nestedVirtualization': False, u'liveMerge':
u'true', u'hooks':
{u'before_vm_start': {u'50_hostedengine': {u'md5':
u'95c810cdcfe4195302a59574a5148289'}, u'50_vhostmd': {u'md5':
u'9206bc390bcbf208b06a8e899581be2d'}, u'50_vfio_mdev': {u'md5':
u'b77eaad7b8f5ca3825dca1b38b0dd446'}}, u'after_network_setup':
{u'30_ethtool_options': {u'md5':
u'f04c2ca5dce40663e2ed69806eea917c'}},
u'after_vm_destroy': {u'50_vhostmd': {u'md5':
u'd70f7ee0453f632e87c09a157ff8ff66'}, u'50_vfio_mdev': {u'md5':
u'35364c6e08ecc58410e31af6c73b17f6'}}, u'after_vm_start':
{u'openstacknet_utils.py': {u'md5':
u'640ef8b473cbc16b102c5ebf57ea533f'},
u'50_openstacknet': {u'md5':
u'ea0a5a715da8c1badbcda28e8b8fa00e'}},
u'after_device_migrate_destination': {u'openstacknet_utils.py':
{u'md5':
u'640ef8b473cbc16b102c5ebf57ea533f'}, u'50_openstacknet':
{u'md5':
u'f93c12af7454bbba4c1ef445c2bc9860'}}, u'before_network_setup':
{u'50_fcoe': {u'md5': u'28c352339c8beef1e1b05c67d106d062'}},
u'after_get_caps': {u'openstacknet_utils.py': {u'md5':
u'640ef8b473cbc16b102c5ebf57ea533f'}, u'50_openstacknet':
{u'md5':
u'0e00c63ab44a952e722209ea31fd7a71'}, u'ovirt_provider_ovn_hook':
{u'md5':
u'257990644ee6c7824b522e2ab3077ae7'}}, u'after_nic_hotplug':
{u'openstacknet_utils.py': {u'md5':
u'640ef8b473cbc16b102c5ebf57ea533f'},
u'50_openstacknet': {u'md5':
u'f93c12af7454bbba4c1ef445c2bc9860'}},
u'before_vm_migrate_destination': {u'50_vhostmd': {u'md5':
u'9206bc390bcbf208b06a8e899581be2d'}}, u'after_device_create':
{u'openstacknet_utils.py': {u'md5':
u'640ef8b473cbc16b102c5ebf57ea533f'},
u'50_openstacknet': {u'md5':
u'f93c12af7454bbba4c1ef445c2bc9860'}},
u'before_vm_dehibernate': {u'50_vhostmd': {u'md5':
u'9206bc390bcbf208b06a8e899581be2d'}}, u'before_nic_hotplug':
{u'50_vmfex':
{u'md5': u'49caba1a5faadd8efacef966f79bc30a'},
u'openstacknet_utils.py':
{u'md5': u'640ef8b473cbc16b102c5ebf57ea533f'}, u'50_openstacknet':
{u'md5':
u'0438de5ff9b6bf8d3160804ff71bf827'}, u'ovirt_provider_ovn_hook':
{u'md5':
u'ada7e4e757fd0241b682b5cc6a545fc8'}},
u'before_device_migrate_destination': {u'50_vmfex': {u'md5':
u'49caba1a5faadd8efacef966f79bc30a'}}, u'before_device_create':
{u'50_vmfex': {u'md5': u'49caba1a5faadd8efacef966f79bc30a'},
u'openstacknet_utils.py': {u'md5':
u'640ef8b473cbc16b102c5ebf57ea533f'},
u'50_openstacknet': {u'md5':
u'0438de5ff9b6bf8d3160804ff71bf827'},
u'ovirt_provider_ovn_hook': {u'md5':
u'ada7e4e757fd0241b682b5cc6a545fc8'}}}, u'supportsIPv6': True,
u'realtimeKernel': False, u'vmTypes': [u'kvm'],
u'liveSnapshot': u'true',
u'cpuThreads': u'16', u'kdumpStatus': 0, u'networks':
{u'ovirtmgmt':
{u'iface': u'ovirtmgmt', u'ipv6autoconf': False, u'addr':
u'192.168.8.12',
u'dhcpv6': False, u'ipv6addrs': [], u'switch': u'legacy',
u'bridged': True,
u'mtu': u'1500', u'dhcpv4': True, u'netmask':
u'255.255.255.0',
u'ipv4defaultroute': True, u'stp': u'off', u'ipv4addrs':
[u'192.168.8.12/24'],
u'ipv6gateway': u'::', u'gateway': u'192.168.8.1',
u'ports': [u'em1']},
u'Public_Fiber': {u'iface': u'Public_Fiber',
u'ipv6autoconf': False,
u'addr': u'', u'dhcpv6': False, u'ipv6addrs': [],
u'switch': u'legacy',
u'bridged': True, u'mtu': u'1500', u'dhcpv4': False,
u'netmask': u'',
u'ipv4defaultroute': False, u'stp': u'off', u'ipv4addrs':
[],
u'ipv6gateway': u'::', u'gateway': u'', u'ports':
[u'em1.3']}, u'Gluster':
{u'iface': u'em4', u'ipv6autoconf': False, u'addr':
u'172.172.1.12',
u'dhcpv6': False, u'ipv6addrs': [], u'switch': u'legacy',
u'bridged':
False, u'mtu': u'1500', u'dhcpv4': False, u'netmask':
u'255.255.255.0',
u'ipv4defaultroute': False, u'ipv4addrs': [u'172.172.1.12/24'],
u'interface': u'em4', u'ipv6gateway': u'::',
u'gateway': u''}},
u'kernelArgs': u'BOOT_IMAGE=/vmlinuz-3.10.0-862.3.2.el7.x86_64
root=/dev/mapper/centos_ovirt-root ro crashkernel=auto
rd.lvm.lv=centos_ovirt/root
rd.lvm.lv=centos_ovirt/swap rhgb quiet LANG=en_US.UTF-8', u'bridges':
{u'Public_Fiber': {u'ipv6autoconf': False, u'addr': u'',
u'dhcpv6': False,
u'ipv6addrs': [], u'mtu': u'1500', u'dhcpv4': False,
u'netmask': u'',
u'ipv4defaultroute': False, u'stp': u'off', u'ipv4addrs':
[],
u'ipv6gateway': u'::', u'gateway': u'', u'opts':
{u'multicast_last_member_count': u'2', u'vlan_protocol':
u'0x8100',
u'hash_elasticity': u'4', u'multicast_query_response_interval':
u'1000',
u'group_fwd_mask': u'0x0', u'multicast_snooping': u'1',
u'multicast_startup_query_interval': u'3125', u'hello_timer':
u'0',
u'multicast_querier_interval': u'25500', u'max_age':
u'2000', u'hash_max':
u'512', u'stp_state': u'0', u'topology_change_detected':
u'0', u'priority':
u'32768', u'multicast_igmp_version': u'2',
u'multicast_membership_interval': u'26000', u'root_path_cost':
u'0',
u'root_port': u'0', u'multicast_stats_enabled': u'0',
u'multicast_startup_query_count': u'2', u'nf_call_iptables':
u'0',
u'vlan_stats_enabled': u'0', u'hello_time': u'200',
u'topology_change':
u'0', u'bridge_id': u'8000.00219b900710',
u'topology_change_timer': u'0',
u'ageing_time': u'30000', u'nf_call_ip6tables': u'0',
u'multicast_mld_version': u'1', u'gc_timer': u'25509',
u'root_id':
u'8000.00219b900710', u'nf_call_arptables': u'0',
u'group_addr':
u'1:80:c2:0:0:0', u'multicast_last_member_interval': u'100',
u'default_pvid': u'1', u'multicast_query_interval':
u'12500',
u'multicast_query_use_ifaddr': u'0', u'tcn_timer': u'0',
u'multicast_router': u'1', u'vlan_filtering': u'0',
u'multicast_querier':
u'0', u'forward_delay': u'0'}, u'ports':
[u'em1.3']}, u'ovirtmgmt':
{u'ipv6autoconf': False, u'addr': u'192.168.8.12',
u'dhcpv6': False,
u'ipv6addrs': [], u'mtu': u'1500', u'dhcpv4': True,
u'netmask':
u'255.255.255.0', u'ipv4defaultroute': True, u'stp':
u'off', u'ipv4addrs':
[u'192.168.8.12/24'], u'ipv6gateway': u'::', u'gateway':
u'192.168.8.1',
u'opts': {u'multicast_last_member_count': u'2',
u'vlan_protocol':
u'0x8100', u'hash_elasticity': u'4',
u'multicast_query_response_interval':
u'1000', u'group_fwd_mask': u'0x0', u'multicast_snooping':
u'1',
u'multicast_startup_query_interval': u'3125', u'hello_timer':
u'0',
u'multicast_querier_interval': u'25500', u'max_age':
u'2000', u'hash_max':
u'512', u'stp_state': u'0', u'topology_change_detected':
u'0', u'priority':
u'32768', u'multicast_igmp_version': u'2',
u'multicast_membership_interval': u'26000', u'root_path_cost':
u'0',
u'root_port': u'0', u'multicast_stats_enabled': u'0',
u'multicast_startup_query_count': u'2', u'nf_call_iptables':
u'0',
u'vlan_stats_enabled': u'0', u'hello_time': u'200',
u'topology_change':
u'0', u'bridge_id': u'8000.00219b900710',
u'topology_change_timer': u'0',
u'ageing_time': u'30000', u'nf_call_ip6tables': u'0',
u'multicast_mld_version': u'1', u'gc_timer': u'13221',
u'root_id':
u'8000.00219b900710', u'nf_call_arptables': u'0',
u'group_addr':
u'1:80:c2:0:0:0', u'multicast_last_member_interval': u'100',
u'default_pvid': u'1', u'multicast_query_interval':
u'12500',
u'multicast_query_use_ifaddr': u'0', u'tcn_timer': u'0',
u'multicast_router': u'1', u'vlan_filtering': u'0',
u'multicast_querier':
u'0', u'forward_delay': u'0'}, u'ports':
[u'em1']}}, u'uuid':
u'4C4C4544-005A-4710-8034-B2C04F4C4B31', u'onlineCpus':
u'0,2,4,6,8,10,12,14,1,3,5,7,9,11,13,15', u'nameservers':
[u'192.168.8.1',
u'8.8.8.8'], u'nics': {u'em4': {u'ipv6autoconf': False,
u'addr':
u'172.172.1.12', u'speed': 1000, u'dhcpv6': False,
u'ipv6addrs': [],
u'mtu': u'1500', u'dhcpv4': False, u'netmask':
u'255.255.255.0',
u'ipv4defaultroute': False, u'ipv4addrs': [u'172.172.1.12/24'],
u'hwaddr':
u'00:21:9b:90:07:16', u'ipv6gateway': u'::', u'gateway':
u''}, u'em1':
{u'ipv6autoconf': False, u'addr': u'', u'speed': 1000,
u'dhcpv6': False,
u'ipv6addrs': [], u'mtu': u'1500', u'dhcpv4': False,
u'netmask': u'',
u'ipv4defaultroute': False, u'ipv4addrs': [], u'hwaddr':
u'00:21:9b:90:07:10', u'ipv6gateway': u'::', u'gateway':
u''}, u'em3':
{u'ipv6autoconf': True, u'addr': u'', u'speed': 0,
u'dhcpv6': False,
u'ipv6addrs': [], u'mtu': u'1500', u'dhcpv4': False,
u'netmask': u'',
u'ipv4defaultroute': False, u'ipv4addrs': [], u'hwaddr':
u'00:21:9b:90:07:14', u'ipv6gateway': u'::', u'gateway':
u''}, u'em2':
{u'ipv6autoconf': True, u'addr': u'', u'speed': 0,
u'dhcpv6': False,
u'ipv6addrs': [], u'mtu': u'1500', u'dhcpv4': False,
u'netmask': u'',
u'ipv4defaultroute': False, u'ipv4addrs': [], u'hwaddr':
u'00:21:9b:90:07:12', u'ipv6gateway': u'::', u'gateway':
u''}},
u'software_revision': u'1', u'hostdevPassthrough':
u'false',
u'clusterLevels': [u'3.6', u'4.0', u'4.1',
u'4.2'], u'cpuFlags':
u'fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,rdtscp,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,nopl,xtopology,nonstop_tsc,aperfmperf,pni,dtes64,monitor,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,dca,sse4_1,sse4_2,popcnt,lahf_lm,tpr_shadow,vnmi,flexpriority,ept,vpid,dtherm,ida,model_Nehalem,model_Conroe,model_coreduo,model_core2duo,model_Penryn,model_n270',
u'kernelFeatures': {u'RETP': 1, u'IBRS': 0, u'PTI': 1},
u'ISCSIInitiatorName': u'iqn.1994-05.com.redhat:1283ce7a3171',
u'netConfigDirty': u'False', u'selinux': {u'mode':
u'1'},
u'autoNumaBalancing': 1, u'reservedMem': u'321',
u'containers': False,
u'bondings': {}, u'software_version': u'4.20',
u'supportedENGINEs':
[u'3.6', u'4.0', u'4.1', u'4.2'], u'cpuSpeed':
u'2527.017', u'numaNodes':
{u'1': {u'totalMemory': u'16371', u'cpus': [1, 3, 5, 7, 9,
11, 13, 15]},
u'0': {u'totalMemory': u'16384', u'cpus': [0, 2, 4, 6, 8,
10, 12, 14]}},
u'cpuSockets': u'2', u'vlans': {u'em1.3':
{u'iface': u'em1',
u'ipv6autoconf': False, u'addr': u'', u'dhcpv6': False,
u'ipv6addrs': [],
u'vlanid': 3, u'mtu': u'1500', u'dhcpv4': False,
u'netmask': u'',
u'ipv4defaultroute': False, u'ipv4addrs': [], u'ipv6gateway':
u'::',
u'gateway': u''}}, u'version_name': u'Snow Man',
'lastClientIface': 'lo',
u'cpuCores': u'8', u'hostedEngineDeployed': True,
u'hugepages': [2048],
u'guestOverhead': u'65', u'additionalFeatures':
[u'libgfapi_supported',
u'GLUSTER_SNAPSHOT', u'GLUSTER_GEO_REPLICATION',
u'GLUSTER_BRICK_MANAGEMENT'], u'kvmEnabled': u'true',
u'memSize': u'31996',
u'emulatedMachines': [u'pc-i440fx-rhel7.1.0',
u'pc-q35-rhel7.3.0',
u'rhel6.3.0', u'pc-i440fx-rhel7.5.0', u'pc-i440fx-rhel7.0.0',
u'rhel6.1.0',
u'pc-i440fx-rhel7.4.0', u'rhel6.6.0', u'pc-q35-rhel7.5.0',
u'rhel6.2.0',
u'pc', u'pc-i440fx-rhel7.3.0', u'q35',
u'pc-i440fx-rhel7.2.0',
u'rhel6.4.0', u'pc-q35-rhel7.4.0', u'rhel6.0.0',
u'rhel6.5.0'],
u'rngSources': [u'hwrng', u'random'], u'operatingSystem':
{u'release':
u'5.1804.el7.centos.2', u'pretty_name': u'CentOS Linux 7 (Core)',
u'version': u'7', u'name': u'RHEL'}}} from=::1,35660
(api:52)
2018-05-30 00:00:23,593-0700 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC
call Host.getCapabilities succeeded in 3.26 seconds (__init__:573)
2018-05-30 00:00:23,724-0700 INFO (vm/50392390) [root]
/usr/libexec/vdsm/hooks/before_device_create/50_openstacknet: rc=0 err=
(hooks:110)
2018-05-30 00:00:23,919-0700 INFO (vm/50392390) [root]
/usr/libexec/vdsm/hooks/before_device_create/50_vmfex: rc=0 err= (hooks:110)
2018-05-30 00:00:24,213-0700 INFO (vm/50392390) [root]
/usr/libexec/vdsm/hooks/before_device_create/openstacknet_utils.py: rc=0
err= (hooks:110)
2018-05-30 00:00:24,392-0700 INFO (vm/50392390) [root]
/usr/libexec/vdsm/hooks/before_device_create/ovirt_provider_ovn_hook: rc=0
err= (hooks:110)
2018-05-30 00:00:24,401-0700 INFO (jsonrpc/1) [api.host] START
getAllVmStats() from=::1,33766 (api:46)
2018-05-30 00:00:24,403-0700 INFO (jsonrpc/1) [api.host] FINISH
getAllVmStats return={'status': {'message': 'Done',
'code': 0},
'statsList': (suppressed)} from=::1,33766 (api:52)
2018-05-30 00:00:24,403-0700 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC
call Host.getAllVmStats succeeded in 0.01 seconds (__init__:573)
2018-05-30 00:00:24,623-0700 INFO (vm/50392390) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=0 err=
(hooks:110)
2018-05-30 00:00:24,836-0700 INFO (vm/50392390) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=0 err= (hooks:110)
2018-05-30 00:00:25,022-0700 INFO (vm/50392390) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_vhostmd: rc=0 err= (hooks:110)
2018-05-30 00:00:25,023-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') <?xml version="1.0"
encoding="utf-8"?><domain type="kvm"
xmlns:ns0="http://ovirt.org/vm/tune/1.0"
xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
<name>HostedEngine</name>
<uuid>50392390-4f89-435c-bf3b-8254b58a4ef7</uuid>
<memory>8388608</memory>
<currentMemory>8388608</currentMemory>
<maxMemory slots="16">33554432</maxMemory>
<vcpu current="4">16</vcpu>
<sysinfo type="smbios">
<system>
<entry name="manufacturer">oVirt</entry>
<entry name="product">oVirt Node</entry>
<entry name="version">7-5.1804.el7.centos.2</entry>
<entry
name="serial">4C4C4544-005A-4710-8034-B2C04F4C4B31</entry>
<entry
name="uuid">50392390-4f89-435c-bf3b-8254b58a4ef7</entry>
</system>
</sysinfo>
<clock adjustment="0" offset="variable">
<timer name="rtc" tickpolicy="catchup"/>
<timer name="pit" tickpolicy="delay"/>
<timer name="hpet" present="no"/>
</clock>
<features>
<acpi/>
</features>
<cpu match="exact">
<model>Nehalem</model>
<topology cores="1" sockets="16"
threads="1"/>
<numa>
<cell cpus="0,1,2,3" id="0"
memory="8388608"/>
</numa>
</cpu>
<cputune/>
<devices>
<input bus="usb" type="tablet"/>
<channel type="unix">
<target name="ovirt-guest-agent.0" type="virtio"/>
<source mode="bind"
path="/var/lib/libvirt/qemu/channels/50392390-4f89-435c-bf3b-8254b58a4ef7.ovirt-guest-agent.0"/>
</channel>
<channel type="unix">
<target name="org.qemu.guest_agent.0"
type="virtio"/>
<source mode="bind"
path="/var/lib/libvirt/qemu/channels/50392390-4f89-435c-bf3b-8254b58a4ef7.org.qemu.guest_agent.0"/>
</channel>
<controller index="0" ports="16"
type="virtio-serial">
<address bus="0x00" domain="0x0000"
function="0x0" slot="0x03"
type="pci"/>
</controller>
<rng model="virtio">
<backend model="random">/dev/urandom</backend>
</rng>
<graphics autoport="yes" keymap="en-us"
passwd="*****"
passwdValidTo="1970-01-01T00:00:01" port="-1"
type="vnc">
<listen network="vdsm-ovirtmgmt" type="network"/>
</graphics>
<controller type="usb">
<address bus="0x00" domain="0x0000"
function="0x2" slot="0x01"
type="pci"/>
</controller>
<controller type="ide">
<address bus="0x00" domain="0x0000"
function="0x1" slot="0x01"
type="pci"/>
</controller>
<video>
<model heads="1" type="cirrus"
vram="16384"/>
<address bus="0x00" domain="0x0000"
function="0x0" slot="0x05"
type="pci"/>
</video>
<memballoon model="none"/>
<disk device="cdrom" snapshot="no"
type="file">
<driver error_policy="report" name="qemu"
type="raw"/>
<source file="" startupPolicy="optional"/>
<target bus="ide" dev="hdc"/>
<readonly/>
<address bus="1" controller="0" target="0"
type="drive"
unit="0"/>
</disk>
<disk device="disk" snapshot="no"
type="file">
<target bus="virtio" dev="vda"/>
<source
file="/var/run/vdsm/storage/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f"/>
<driver cache="none" error_policy="stop"
io="threads"
name="qemu" type="raw"/>
<alias name="ua-3de00d8c-d8b0-4ae0-9363-38a504f5d2b2"/>
<address bus="0x00" domain="0x0000"
function="0x0" slot="0x04"
type="pci"/>
<serial>3de00d8c-d8b0-4ae0-9363-38a504f5d2b2</serial>
</disk>
<lease>
<key>d0fcd8de-6105-4f33-a674-727e3a11e89f</key>
<lockspace>c0acdefb-7d16-48ec-9d76-659b8fe33e2a</lockspace>
<target offset="0"
path="/rhev/data-center/mnt/glusterSD/192.168.8.11:
_engine/c0acdefb-7d16-48ec-9d76-659b8fe33e2a/images/3de00d8c-d8b0-4ae0-9363-38a504f5d2b2/d0fcd8de-6105-4f33-a674-727e3a11e89f.lease"/>
</lease>
<interface type="bridge">
<model type="virtio"/>
<link state="up"/>
<source bridge="ovirtmgmt"/>
<alias name="ua-9f93c126-cbb3-4c5b-909e-41a53c2e31fb"/>
<address bus="0x00" domain="0x0000"
function="0x0" slot="0x02"
type="pci"/>
<mac address="00:16:3e:2d:bd:b1"/>
<filterref filter="vdsm-no-mac-spoofing"/>
<bandwidth/>
</interface>
<channel type="unix"><target
name="org.ovirt.hosted-engine-setup.0"
type="virtio"/><source mode="bind"
path="/var/lib/libvirt/qemu/channels/50392390-4f89-435c-bf3b-8254b58a4ef7.org.ovirt.hosted-engine-setup.0"/></channel></devices>
<pm>
<suspend-to-disk enabled="no"/>
<suspend-to-mem enabled="no"/>
</pm>
<os>
<type arch="x86_64"
machine="pc-i440fx-rhel7.3.0">hvm</type>
<smbios mode="sysinfo"/>
</os>
<metadata>
<ns0:qos/>
<ovirt-vm:vm>
<minGuaranteedMemoryMb
type="int">8192</minGuaranteedMemoryMb>
<clusterVersion>4.2</clusterVersion>
<ovirt-vm:custom/>
<ovirt-vm:device mac_address="00:16:3e:2d:bd:b1">
<ovirt-vm:custom/>
</ovirt-vm:device>
<ovirt-vm:device devtype="disk" name="vda">
<ovirt-vm:poolID>00000000-0000-0000-0000-000000000000</ovirt-vm:poolID>
<ovirt-vm:volumeID>d0fcd8de-6105-4f33-a674-727e3a11e89f</ovirt-vm:volumeID>
<ovirt-vm:shared>exclusive</ovirt-vm:shared>
<ovirt-vm:imageID>3de00d8c-d8b0-4ae0-9363-38a504f5d2b2</ovirt-vm:imageID>
<ovirt-vm:domainID>c0acdefb-7d16-48ec-9d76-659b8fe33e2a</ovirt-vm:domainID>
</ovirt-vm:device>
<launchPaused>false</launchPaused>
<resumeBehavior>auto_resume</resumeBehavior>
</ovirt-vm:vm>
</metadata>
<on_poweroff>destroy</on_poweroff><on_reboot>destroy</on_reboot><on_crash>destroy</on_crash></domain>
(vm:2867)
2018-05-30 00:00:25,252-0700 INFO (monitor/c0acdef) [storage.SANLock]
Acquiring host id for domain c0acdefb-7d16-48ec-9d76-659b8fe33e2a (id=2,
async=True) (clusterlock:284)
2018-05-30 00:00:26,808-0700 ERROR (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') The vm start process failed
(vm:943)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in
_startUnderlyingVm
self._run()
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in
_run
dom.createWithFlags(flags)
File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py",
line 130, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92,
in wrapper
return func(inst, *args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in
createWithFlags
if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed',
dom=self)
libvirtError: Failed to acquire lock: No space left on device
2018-05-30 00:00:26,809-0700 INFO (vm/50392390) [virt.vm]
(vmId='50392390-4f89-435c-bf3b-8254b58a4ef7') Changed state to Down: Failed
to acquire lock: No space left on device (code=1) (vm:1683)
-------------
The "failed to acquire lock: no space left on device" error has me
conserned...Yet there is no device that is full on any servers (espciially
the one I'm on):
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos_ovirt-root 8.0G 6.6G 1.5G 82% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 4.0K 16G 1% /dev/shm
tmpfs 16G 18M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/gluster-data 177G 59G 119G 33% /gluster/brick2
/dev/mapper/gluster-engine 25G 18G 7.9G 69% /gluster/brick1
/dev/mapper/vg_thin-iso 25G 7.5G 18G 30% /gluster/brick4
/dev/mapper/vg_thin-data--hdd 1.2T 308G 892G 26% /gluster/brick3
/dev/sda1 497M 313M 185M 63% /boot
192.168.8.11:/engine 25G 20G 5.5G 79%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
tmpfs 3.2G 0 3.2G 0% /run/user/0
192.168.8.11:/data 136G 60G 77G 44%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_data
192.168.8.11:/iso 25G 7.5G 18G 30%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_iso
172.172.1.11:/data-hdd 1.2T 308G 892G 26%
/rhev/data-center/mnt/glusterSD/172.172.1.11:_data-hdd
--------
--Jim
On Tue, May 29, 2018 at 11:53 PM, Jim Kusznir <jim(a)palousetech.com> wrote:
I've just gotten back to where hsoted-engine --vm-status does
(eventually)
return some output other than the ha-agent or storage is down error. It
does currently take up to two minutes, and the last run returned stale data.
Gluster-wise, gluster volume heal <volume> info shows the three volumes
with all bricks are all showing 0 entries (and return quickly). The 4th
volume missing one brick (but still has two replicas) does return entries,
and is currently taking a very long time to come back with them.
Making progress toward getting it online again...Don't know why I get
stale data in hosted-engine --vm-status or how to overcome that...
--Jim
On Tue, May 29, 2018 at 11:38 PM, Jim Kusznir <jim(a)palousetech.com> wrote:
> I believe the gluster data store for the engine is up and working
> correctly. The rest are not mounted, as the engine hasn't managed to start
> correctly yet. I did perform the copy-a-junk-file into the data store,
> then check to ensure its there on another host, then delete that and see
> that it disappeared on the first host; it passed that test. Here's the
> info and status. (I have NOT performed the steps that Krutika and
> Ravishankar suggested yet, as I don't have my data volumes working again
> yet.
>
> [root@ovirt2 images]# gluster volume info
>
> Volume Name: data
> Type: Replicate
> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
> Options Reconfigured:
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> changelog.changelog: on
> geo-replication.ignore-pid-check: on
> geo-replication.indexing: on
> server.allow-insecure: on
> performance.readdir-ahead: on
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> storage.owner-uid: 36
> storage.owner-gid: 36
> features.shard: on
> features.shard-block-size: 512MB
> performance.low-prio-threads: 32
> cluster.data-self-heal-algorithm: full
> cluster.locking-scheme: granular
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 8
> network.ping-timeout: 30
> user.cifs: off
> nfs.disable: on
> performance.strict-o-direct: on
>
> Volume Name: data-hdd
> Type: Replicate
> Volume ID: d342a3ab-16f3-49f0-bbcf-f788be8ac5f1
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.172.1.11:/gluster/brick3/data-hdd
> Brick2: 172.172.1.12:/gluster/brick3/data-hdd
> Brick3: 172.172.1.13:/gluster/brick3/data-hdd
> Options Reconfigured:
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> network.ping-timeout: 30
> server.allow-insecure: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> user.cifs: off
> features.shard: on
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 8
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> cluster.eager-lock: enable
> network.remote-dio: enable
> performance.low-prio-threads: 32
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> changelog.changelog: on
> geo-replication.ignore-pid-check: on
> geo-replication.indexing: on
> transport.address-family: inet
> performance.readdir-ahead: on
>
> Volume Name: engine
> Type: Replicate
> Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine
> Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine
> Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter)
> Options Reconfigured:
> changelog.changelog: on
> geo-replication.ignore-pid-check: on
> geo-replication.indexing: on
> performance.readdir-ahead: on
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: off
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> storage.owner-uid: 36
> storage.owner-gid: 36
> features.shard: on
> features.shard-block-size: 512MB
> performance.low-prio-threads: 32
> cluster.data-self-heal-algorithm: full
> cluster.locking-scheme: granular
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 6
> network.ping-timeout: 30
> user.cifs: off
> nfs.disable: on
> performance.strict-o-direct: on
>
> Volume Name: iso
> Type: Replicate
> Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso
> Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso
> Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter)
> Options Reconfigured:
> performance.strict-o-direct: on
> nfs.disable: on
> user.cifs: off
> network.ping-timeout: 30
> cluster.shd-max-threads: 6
> cluster.shd-wait-qlength: 10000
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> performance.low-prio-threads: 32
> features.shard-block-size: 512MB
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> network.remote-dio: off
> cluster.eager-lock: enable
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> performance.readdir-ahead: on
>
>
> [root@ovirt2 images]# gluster volume status
> Status of volume: data
> Gluster process TCP Port RDMA Port Online
> Pid
> ------------------------------------------------------------
> ------------------
> Brick ovirt1.nwfiber.com:/gluster/brick2/da
> ta 49152 0 Y
> 3226
> Brick ovirt2.nwfiber.com:/gluster/brick2/da
> ta 49152 0 Y
> 2967
> Brick ovirt3.nwfiber.com:/gluster/brick2/da
> ta 49152 0 Y
> 2554
> Self-heal Daemon on localhost N/A N/A Y
> 4818
> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
> 17853
> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
> 4771
>
> Task Status of Volume data
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
> Status of volume: data-hdd
> Gluster process TCP Port RDMA Port Online
> Pid
> ------------------------------------------------------------
> ------------------
> Brick 172.172.1.11:/gluster/brick3/data-hdd 49153 0 Y
> 3232
> Brick 172.172.1.12:/gluster/brick3/data-hdd 49153 0 Y
> 2976
> Brick 172.172.1.13:/gluster/brick3/data-hdd N/A N/A N
> N/A
> NFS Server on localhost N/A N/A N
> N/A
> Self-heal Daemon on localhost N/A N/A Y
> 4818
> NFS Server on
ovirt3.nwfiber.com N/A N/A N
> N/A
> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
> 17853
> NFS Server on
ovirt1.nwfiber.com N/A N/A N
> N/A
> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
> 4771
>
> Task Status of Volume data-hdd
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
> Status of volume: engine
> Gluster process TCP Port RDMA Port Online
> Pid
> ------------------------------------------------------------
> ------------------
> Brick ovirt1.nwfiber.com:/gluster/brick1/en
> gine 49154 0 Y
> 3239
> Brick ovirt2.nwfiber.com:/gluster/brick1/en
> gine 49154 0 Y
> 2982
> Brick ovirt3.nwfiber.com:/gluster/brick1/en
> gine 49154 0 Y
> 2578
> Self-heal Daemon on localhost N/A N/A Y
> 4818
> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
> 17853
> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
> 4771
>
> Task Status of Volume engine
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
> Status of volume: iso
> Gluster process TCP Port RDMA Port Online
> Pid
> ------------------------------------------------------------
> ------------------
> Brick ovirt1.nwfiber.com:/gluster/brick4/is
> o 49155 0 Y
> 3247
> Brick ovirt2.nwfiber.com:/gluster/brick4/is
> o 49155 0 Y
> 2990
> Brick ovirt3.nwfiber.com:/gluster/brick4/is
> o 49155 0 Y
> 2580
> Self-heal Daemon on localhost N/A N/A Y
> 4818
> Self-heal Daemon on
ovirt3.nwfiber.com N/A N/A Y
> 17853
> Self-heal Daemon on
ovirt1.nwfiber.com N/A N/A Y
> 4771
>
> Task Status of Volume iso
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
>
> On Tue, May 29, 2018 at 11:30 PM, Sahina Bose <sabose(a)redhat.com> wrote:
>
>>
>>
>> On Wed, May 30, 2018 at 10:42 AM, Jim Kusznir <jim(a)palousetech.com>
>> wrote:
>>
>>> hosted-engine --deploy failed (would not come up on my existing gluster
>>> storage). However, I realized no changes were written to my existing
>>> storage. So, I went back to trying to get my old engine running.
>>>
>>> hosted-engine --vm-status is now taking a very long time (5+minutes) to
>>> return, and it returns stail information everywhere. I thought perhaps the
>>> lockspace is corrupt, so tried to clean that and metadata, but both are
>>> failing (--cleam-metadata has hung and I can't even ctrl-c out of it).
>>>
>>> How can I reinitialize all the lockspace/metadata safely? There is no
>>> engine or VMs running currently....
>>>
>>
>> I think the first thing to make sure is that your storage is up and
>> running. So can you mount the gluster volumes and able to access the
>> contents there?
>> Please provide the gluster volume info and gluster volume status of the
>> volumes that you're using.
>>
>>
>>
>>> --Jim
>>>
>>> On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir <jim(a)palousetech.com>
>>> wrote:
>>>
>>>> Well, things went from bad to very, very bad....
>>>>
>>>> It appears that during one of the 2 minute lockups, the fencing agents
>>>> decided that another node in the cluster was down. As a result, 2 of the
3
>>>> nodes were simultaneously reset with fencing agent reboot. After the
nodes
>>>> came back up, the engine would not start. All running VMs (including
VMs
>>>> on the 3rd node that was not rebooted) crashed.
>>>>
>>>> I've now been working for about 3 hours trying to get the engine to
>>>> come up. I don't know why it won't start. hosted-engine
--vm-start says
>>>> its starting, but it doesn't start (virsh doesn't show any VMs
running).
>>>> I'm currently running --deploy, as I had run out of options for
anything
>>>> else I can come up with. I hope this will allow me to re-import all my
>>>> existing VMs and allow me to start them back up after everything comes
back
>>>> up.
>>>>
>>>> I do have an unverified geo-rep backup; I don't know if it is a good
>>>> backup (there were several prior messages to this list, but I didn't
get
>>>> replies to my questions. It was running in what I believe to be
"strange",
>>>> and the data directories are larger than their source).
>>>>
>>>> I'll see if my --deploy works, and if not, I'll be back with
another
>>>> message/help request.
>>>>
>>>> When the dust settles and I'm at least minimally functional again, I
>>>> really want to understand why all these technologies designed to offer
>>>> redundancy conspired to reduce uptime and create failures where there
>>>> weren't any otherwise. I thought with hosted engine, 3 ovirt servers
and
>>>> glusterfs with minimum replica 2+arb or replica 3 should have offered
>>>> strong resilience against server failure or disk failure, and should
have
>>>> prevented / recovered from data corruption. Instead, all of the above
>>>> happened (once I get my cluster back up, I still have to try and recover
my
>>>> webserver VM, which won't boot due to XFS corrupt journal issues
created
>>>> during the gluster crashes). I think a lot of these issues were rooted
>>>> from the upgrade from 4.1 to 4.2.
>>>>
>>>> --Jim
>>>>
>>>> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir <jim(a)palousetech.com>
>>>> wrote:
>>>>
>>>>> I also finally found the following in my system log on one server:
>>>>>
>>>>> [10679.524491] INFO: task glusterclogro:14933 blocked for more than
>>>>> 120 seconds.
>>>>> [10679.525826] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [10679.527144] glusterclogro D ffff97209832bf40 0 14933 1
>>>>> 0x00000080
>>>>> [10679.527150] Call Trace:
>>>>> [10679.527161] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [10679.527218] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [10679.527225] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>>> [10679.527254] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0
[xfs]
>>>>> [10679.527260] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>> [10679.527268] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10679.527271] [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20
>>>>> [10679.527275] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [10679.527279] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10679.527283] INFO: task glusterposixfsy:14941 blocked for more
than
>>>>> 120 seconds.
>>>>> [10679.528608] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [10679.529956] glusterposixfsy D ffff972495f84f10 0 14941 1
>>>>> 0x00000080
>>>>> [10679.529961] Call Trace:
>>>>> [10679.529966] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [10679.530003] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [10679.530008] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>>> [10679.530038] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0
[xfs]
>>>>> [10679.530042] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>> [10679.530046] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10679.530050] [<ffffffffb944f3f3>] SyS_fdatasync+0x13/0x20
>>>>> [10679.530054] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [10679.530058] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10679.530062] INFO: task glusteriotwr13:15486 blocked for more than
>>>>> 120 seconds.
>>>>> [10679.531805] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [10679.533732] glusteriotwr13 D ffff9720a83f0000 0 15486 1
>>>>> 0x00000080
>>>>> [10679.533738] Call Trace:
>>>>> [10679.533747] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [10679.533799] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [10679.533806] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>>> [10679.533846] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0
[xfs]
>>>>> [10679.533852] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>> [10679.533858] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10679.533863] [<ffffffffb944f3f3>] SyS_fdatasync+0x13/0x20
>>>>> [10679.533868] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [10679.533873] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10919.512757] INFO: task glusterclogro:14933 blocked for more than
>>>>> 120 seconds.
>>>>> [10919.514714] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [10919.516663] glusterclogro D ffff97209832bf40 0 14933 1
>>>>> 0x00000080
>>>>> [10919.516677] Call Trace:
>>>>> [10919.516690] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [10919.516696] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>> [10919.516703] [<ffffffffb951cc04>] ?
blk_finish_plug+0x14/0x40
>>>>> [10919.516768] [<ffffffffc05e9224>] ?
_xfs_buf_ioapply+0x334/0x460
>>>>> [xfs]
>>>>> [10919.516774] [<ffffffffb991432d>]
wait_for_completion+0xfd/0x140
>>>>> [10919.516782] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>>> [10919.516821] [<ffffffffc05eb0a3>] ? _xfs_buf_read+0x23/0x40
[xfs]
>>>>> [10919.516859] [<ffffffffc05eafa9>]
xfs_buf_submit_wait+0xf9/0x1d0
>>>>> [xfs]
>>>>> [10919.516902] [<ffffffffc061b279>] ?
xfs_trans_read_buf_map+0x199/0x400
>>>>> [xfs]
>>>>> [10919.516940] [<ffffffffc05eb0a3>] _xfs_buf_read+0x23/0x40
[xfs]
>>>>> [10919.516977] [<ffffffffc05eb1b9>]
xfs_buf_read_map+0xf9/0x160 [xfs]
>>>>> [10919.517022] [<ffffffffc061b279>]
xfs_trans_read_buf_map+0x199/0x400
>>>>> [xfs]
>>>>> [10919.517057] [<ffffffffc05c8d04>] xfs_da_read_buf+0xd4/0x100
[xfs]
>>>>> [10919.517091] [<ffffffffc05c8d53>]
xfs_da3_node_read+0x23/0xd0 [xfs]
>>>>> [10919.517126] [<ffffffffc05c9fee>]
xfs_da3_node_lookup_int+0x6e/0x2f0
>>>>> [xfs]
>>>>> [10919.517160] [<ffffffffc05d5a1d>]
xfs_dir2_node_lookup+0x4d/0x170
>>>>> [xfs]
>>>>> [10919.517194] [<ffffffffc05ccf5d>] xfs_dir_lookup+0x1bd/0x1e0
[xfs]
>>>>> [10919.517233] [<ffffffffc05fd8d9>] xfs_lookup+0x69/0x140
[xfs]
>>>>> [10919.517271] [<ffffffffc05fa018>] xfs_vn_lookup+0x78/0xc0
[xfs]
>>>>> [10919.517278] [<ffffffffb9425cf3>] lookup_real+0x23/0x60
>>>>> [10919.517283] [<ffffffffb9426702>] __lookup_hash+0x42/0x60
>>>>> [10919.517288] [<ffffffffb942d519>]
SYSC_renameat2+0x3a9/0x5a0
>>>>> [10919.517296] [<ffffffffb94d3753>] ?
selinux_file_free_security+0x2
>>>>> 3/0x30
>>>>> [10919.517304] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10919.517309] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10919.517313] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10919.517318] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10919.517323] [<ffffffffb942e58e>] SyS_renameat2+0xe/0x10
>>>>> [10919.517328] [<ffffffffb942e5ce>] SyS_rename+0x1e/0x20
>>>>> [10919.517333] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [10919.517339] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11159.496095] INFO: task glusteriotwr9:15482 blocked for more than
>>>>> 120 seconds.
>>>>> [11159.497546] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11159.498978] glusteriotwr9 D ffff971fa0fa1fa0 0 15482 1
>>>>> 0x00000080
>>>>> [11159.498984] Call Trace:
>>>>> [11159.498995] [<ffffffffb9911f00>] ? bit_wait+0x50/0x50
>>>>> [11159.498999] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [11159.499003] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>> [11159.499056] [<ffffffffc05dd9b7>] ?
xfs_iext_bno_to_ext+0xa7/0x1a0
>>>>> [xfs]
>>>>> [11159.499082] [<ffffffffc05dd43e>] ?
xfs_iext_bno_to_irec+0x8e/0xd0
>>>>> [xfs]
>>>>> [11159.499090] [<ffffffffb92f7a12>] ?
ktime_get_ts64+0x52/0xf0
>>>>> [11159.499093] [<ffffffffb9911f00>] ? bit_wait+0x50/0x50
>>>>> [11159.499097] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>> [11159.499101] [<ffffffffb9913528>] io_schedule+0x18/0x20
>>>>> [11159.499104] [<ffffffffb9911f11>] bit_wait_io+0x11/0x50
>>>>> [11159.499107] [<ffffffffb9911ac1>]
__wait_on_bit_lock+0x61/0xc0
>>>>> [11159.499113] [<ffffffffb9393634>] __lock_page+0x74/0x90
>>>>> [11159.499118] [<ffffffffb92bc210>] ?
wake_bit_function+0x40/0x40
>>>>> [11159.499121] [<ffffffffb9394154>]
__find_lock_page+0x54/0x70
>>>>> [11159.499125] [<ffffffffb9394e85>]
grab_cache_page_write_begin+0x
>>>>> 55/0xc0
>>>>> [11159.499130] [<ffffffffb9484b76>]
iomap_write_begin+0x66/0x100
>>>>> [11159.499135] [<ffffffffb9484edf>]
iomap_write_actor+0xcf/0x1d0
>>>>> [11159.499140] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>> [11159.499144] [<ffffffffb94854e7>] iomap_apply+0xb7/0x150
>>>>> [11159.499149] [<ffffffffb9485621>]
iomap_file_buffered_write+0xa1
>>>>> /0xe0
>>>>> [11159.499153] [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>> [11159.499182] [<ffffffffc05f025d>]
xfs_file_buffered_aio_write+0x12d/0x2c0
>>>>> [xfs]
>>>>> [11159.499213] [<ffffffffc05f057d>]
xfs_file_aio_write+0x18d/0x1b0
>>>>> [xfs]
>>>>> [11159.499217] [<ffffffffb941a533>] do_sync_write+0x93/0xe0
>>>>> [11159.499222] [<ffffffffb941b010>] vfs_write+0xc0/0x1f0
>>>>> [11159.499225] [<ffffffffb941c002>] SyS_pwrite64+0x92/0xc0
>>>>> [11159.499230] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11159.499234] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11159.499238] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11279.488720] INFO: task xfsaild/dm-10:1134 blocked for more than
>>>>> 120 seconds.
>>>>> [11279.490197] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.491665] xfsaild/dm-10 D ffff9720a8660fd0 0 1134 2
>>>>> 0x00000000
>>>>> [11279.491671] Call Trace:
>>>>> [11279.491682] [<ffffffffb92a3a2e>] ?
try_to_del_timer_sync+0x5e/0x9
>>>>> 0
>>>>> [11279.491688] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [11279.491744] [<ffffffffc060de36>] _xfs_log_force+0x1c6/0x2c0
[xfs]
>>>>> [11279.491750] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>>> [11279.491783] [<ffffffffc0619fec>] ? xfsaild+0x16c/0x6f0
[xfs]
>>>>> [11279.491817] [<ffffffffc060df5c>] xfs_log_force+0x2c/0x70
[xfs]
>>>>> [11279.491849] [<ffffffffc0619e80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>>>>> [xfs]
>>>>> [11279.491880] [<ffffffffc0619fec>] xfsaild+0x16c/0x6f0 [xfs]
>>>>> [11279.491913] [<ffffffffc0619e80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>>>>> [xfs]
>>>>> [11279.491919] [<ffffffffb92bb161>] kthread+0xd1/0xe0
>>>>> [11279.491926] [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40
>>>>> [11279.491932] [<ffffffffb9920677>]
ret_from_fork_nospec_begin+0x2
>>>>> 1/0x21
>>>>> [11279.491936] [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40
>>>>> [11279.491976] INFO: task glusterclogfsyn:14934 blocked for more
than
>>>>> 120 seconds.
>>>>> [11279.493466] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.494952] glusterclogfsyn D ffff97209832af70 0 14934 1
>>>>> 0x00000080
>>>>> [11279.494957] Call Trace:
>>>>> [11279.494979] [<ffffffffc0309839>] ?
__split_and_process_bio+0x2e9/0x520
>>>>> [dm_mod]
>>>>> [11279.494983] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [11279.494987] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>> [11279.494997] [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0
>>>>> [dm_mod]
>>>>> [11279.495001] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>> [11279.495005] [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/0x
>>>>> 140
>>>>> [11279.495010] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>>> [11279.495016] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>> [11279.495049] [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/0x20
>>>>> [xfs]
>>>>> [11279.495079] [<ffffffffc05eec40>] xfs_file_fsync+0x1b0/0x1e0
[xfs]
>>>>> [11279.495086] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>> [11279.495090] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11279.495094] [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20
>>>>> [11279.495098] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11279.495102] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11279.495105] INFO: task glusterposixfsy:14941 blocked for more
than
>>>>> 120 seconds.
>>>>> [11279.496606] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.498114] glusterposixfsy D ffff972495f84f10 0 14941 1
>>>>> 0x00000080
>>>>> [11279.498118] Call Trace:
>>>>> [11279.498134] [<ffffffffc0309839>] ?
__split_and_process_bio+0x2e9/0x520
>>>>> [dm_mod]
>>>>> [11279.498138] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [11279.498142] [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>> [11279.498152] [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0
>>>>> [dm_mod]
>>>>> [11279.498156] [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>> [11279.498160] [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/0x
>>>>> 140
>>>>> [11279.498165] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>>> [11279.498169] [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>> [11279.498202] [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/0x20
>>>>> [xfs]
>>>>> [11279.498231] [<ffffffffc05eec40>] xfs_file_fsync+0x1b0/0x1e0
[xfs]
>>>>> [11279.498238] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>> [11279.498242] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11279.498246] [<ffffffffb944f3f3>] SyS_fdatasync+0x13/0x20
>>>>> [11279.498250] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11279.498254] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11279.498257] INFO: task glusteriotwr1:14950 blocked for more than
>>>>> 120 seconds.
>>>>> [11279.499789] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.501343] glusteriotwr1 D ffff97208b6daf70 0 14950 1
>>>>> 0x00000080
>>>>> [11279.501348] Call Trace:
>>>>> [11279.501353] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [11279.501390] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [11279.501396] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>>> [11279.501428] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0
[xfs]
>>>>> [11279.501432] [<ffffffffb944ef3f>]
generic_write_sync+0x4f/0x70
>>>>> [11279.501461] [<ffffffffc05f0545>]
xfs_file_aio_write+0x155/0x1b0
>>>>> [xfs]
>>>>> [11279.501466] [<ffffffffb941a533>] do_sync_write+0x93/0xe0
>>>>> [11279.501471] [<ffffffffb941b010>] vfs_write+0xc0/0x1f0
>>>>> [11279.501475] [<ffffffffb941c002>] SyS_pwrite64+0x92/0xc0
>>>>> [11279.501479] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11279.501483] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11279.501489] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11279.501493] INFO: task glusteriotwr4:14953 blocked for more than
>>>>> 120 seconds.
>>>>> [11279.503047] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.504630] glusteriotwr4 D ffff972499f2bf40 0 14953 1
>>>>> 0x00000080
>>>>> [11279.504635] Call Trace:
>>>>> [11279.504640] [<ffffffffb9913f79>] schedule+0x29/0x70
>>>>> [11279.504676] [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [11279.504681] [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>>> [11279.504710] [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0
[xfs]
>>>>> [11279.504714] [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>>> [11279.504718] [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11279.504722] [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20
>>>>> [11279.504725] [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11279.504730] [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [12127.466494] perf: interrupt took too long (8263 > 8150),
lowering
>>>>> kernel.perf_event_max_sample_rate to 24000
>>>>>
>>>>> --------------------
>>>>> I think this is the cause of the massive ovirt performance issues
>>>>> irrespective of gluster volume. At the time this happened, I was
also
>>>>> ssh'ed into the host, and was doing some rpm querry commands. I
had just
>>>>> run rpm -qa |grep glusterfs (to verify what version was actually
>>>>> installed), and that command took almost 2 minutes to return!
Normally it
>>>>> takes less than 2 seconds. That is all pure local SSD IO, too....
>>>>>
>>>>> I'm no expert, but its my understanding that anytime a software
>>>>> causes these kinds of issues, its a serious bug in the software, even
if
>>>>> its mis-handled exceptions. Is this correct?
>>>>>
>>>>> --Jim
>>>>>
>>>>> On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>> wrote:
>>>>>
>>>>>> I think this is the profile information for one of the volumes
that
>>>>>> lives on the SSDs and is fully operational with no down/problem
disks:
>>>>>>
>>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile data info
>>>>>> Brick: ovirt2.nwfiber.com:/gluster/brick2/data
>>>>>> ----------------------------------------------
>>>>>> Cumulative Stats:
>>>>>> Block Size: 256b+ 512b+
>>>>>> 1024b+
>>>>>> No. of Reads: 983 2696
>>>>>> 1059
>>>>>> No. of Writes: 0 1113
>>>>>> 302
>>>>>>
>>>>>> Block Size: 2048b+ 4096b+
>>>>>> 8192b+
>>>>>> No. of Reads: 852 88608
>>>>>> 53526
>>>>>> No. of Writes: 522 812340
>>>>>> 76257
>>>>>>
>>>>>> Block Size: 16384b+ 32768b+
>>>>>> 65536b+
>>>>>> No. of Reads: 54351 241901
>>>>>> 15024
>>>>>> No. of Writes: 21636 8656
>>>>>> 8976
>>>>>>
>>>>>> Block Size: 131072b+
>>>>>> No. of Reads: 524156
>>>>>> No. of Writes: 296071
>>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of
calls
>>>>>> Fop
>>>>>> --------- ----------- ----------- -----------
------------
>>>>>> ----
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
4189
>>>>>> RELEASE
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
1257
>>>>>> RELEASEDIR
>>>>>> 0.00 46.19 us 12.00 us 187.00 us
69
>>>>>> FLUSH
>>>>>> 0.00 147.00 us 78.00 us 367.00 us
86
>>>>>> REMOVEXATTR
>>>>>> 0.00 223.46 us 24.00 us 1166.00 us
149
>>>>>> READDIR
>>>>>> 0.00 565.34 us 76.00 us 3639.00 us
88
>>>>>> FTRUNCATE
>>>>>> 0.00 263.28 us 20.00 us 28385.00 us
228
>>>>>> LK
>>>>>> 0.00 98.84 us 2.00 us 880.00 us
1198
>>>>>> OPENDIR
>>>>>> 0.00 91.59 us 26.00 us 10371.00 us
3853
>>>>>> STATFS
>>>>>> 0.00 494.14 us 17.00 us 193439.00 us
1171
>>>>>> GETXATTR
>>>>>> 0.00 299.42 us 35.00 us 9799.00 us
2044
>>>>>> READDIRP
>>>>>> 0.00 1965.31 us 110.00 us 382258.00 us
321
>>>>>> XATTROP
>>>>>> 0.01 113.40 us 24.00 us 61061.00 us
8134
>>>>>> STAT
>>>>>> 0.01 755.38 us 57.00 us 607603.00 us
3196
>>>>>> DISCARD
>>>>>> 0.05 2690.09 us 58.00 us 2704761.00 us
3206
>>>>>> OPEN
>>>>>> 0.10 119978.25 us 97.00 us 9406684.00 us
154
>>>>>> SETATTR
>>>>>> 0.18 101.73 us 28.00 us 700477.00 us
313379
>>>>>> FSTAT
>>>>>> 0.23 1059.84 us 25.00 us 2716124.00 us
38255
>>>>>> LOOKUP
>>>>>> 0.47 1024.11 us 54.00 us 6197164.00 us
81455
>>>>>> FXATTROP
>>>>>> 1.72 2984.00 us 15.00 us 37098954.00 us
>>>>>> 103020 FINODELK
>>>>>> 5.92 44315.32 us 51.00 us 24731536.00 us
>>>>>> 23957 FSYNC
>>>>>> 13.27 2399.78 us 25.00 us 22089540.00 us
>>>>>> 991005 READ
>>>>>> 37.00 5980.43 us 52.00 us 22099889.00 us
>>>>>> 1108976 WRITE
>>>>>> 41.04 5452.75 us 13.00 us 22102452.00 us
>>>>>> 1349053 INODELK
>>>>>>
>>>>>> Duration: 10026 seconds
>>>>>> Data Read: 80046027759 bytes
>>>>>> Data Written: 44496632320 bytes
>>>>>>
>>>>>> Interval 1 Stats:
>>>>>> Block Size: 256b+ 512b+
>>>>>> 1024b+
>>>>>> No. of Reads: 983 2696
>>>>>> 1059
>>>>>> No. of Writes: 0 838
>>>>>> 185
>>>>>>
>>>>>> Block Size: 2048b+ 4096b+
>>>>>> 8192b+
>>>>>> No. of Reads: 852 85856
>>>>>> 51575
>>>>>> No. of Writes: 382 705802
>>>>>> 57812
>>>>>>
>>>>>> Block Size: 16384b+ 32768b+
>>>>>> 65536b+
>>>>>> No. of Reads: 52673 232093
>>>>>> 14984
>>>>>> No. of Writes: 13499 4908
>>>>>> 4242
>>>>>>
>>>>>> Block Size: 131072b+
>>>>>> No. of Reads: 460040
>>>>>> No. of Writes: 6411
>>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of
calls
>>>>>> Fop
>>>>>> --------- ----------- ----------- -----------
------------
>>>>>> ----
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
2093
>>>>>> RELEASE
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
1093
>>>>>> RELEASEDIR
>>>>>> 0.00 53.38 us 26.00 us 111.00 us
16
>>>>>> FLUSH
>>>>>> 0.00 145.14 us 78.00 us 367.00 us
71
>>>>>> REMOVEXATTR
>>>>>> 0.00 190.96 us 114.00 us 298.00 us
71
>>>>>> SETATTR
>>>>>> 0.00 213.38 us 24.00 us 1145.00 us
90
>>>>>> READDIR
>>>>>> 0.00 263.28 us 20.00 us 28385.00 us
228
>>>>>> LK
>>>>>> 0.00 101.76 us 2.00 us 880.00 us
1093
>>>>>> OPENDIR
>>>>>> 0.01 93.60 us 27.00 us 10371.00 us
3090
>>>>>> STATFS
>>>>>> 0.02 537.47 us 17.00 us 193439.00 us
1038
>>>>>> GETXATTR
>>>>>> 0.03 297.44 us 35.00 us 9799.00 us
1990
>>>>>> READDIRP
>>>>>> 0.03 2357.28 us 110.00 us 382258.00 us
253
>>>>>> XATTROP
>>>>>> 0.04 385.93 us 58.00 us 47593.00 us
2091
>>>>>> OPEN
>>>>>> 0.04 114.86 us 24.00 us 61061.00 us
7715
>>>>>> STAT
>>>>>> 0.06 444.59 us 57.00 us 333240.00 us
3053
>>>>>> DISCARD
>>>>>> 0.42 316.24 us 25.00 us 290728.00 us
29823
>>>>>> LOOKUP
>>>>>> 0.73 257.92 us 54.00 us 344812.00 us
63296
>>>>>> FXATTROP
>>>>>> 1.37 98.30 us 28.00 us 67621.00 us
313172
>>>>>> FSTAT
>>>>>> 1.58 2124.69 us 51.00 us 849200.00 us
16717
>>>>>> FSYNC
>>>>>> 5.73 162.46 us 52.00 us 748492.00 us
794079
>>>>>> WRITE
>>>>>> 7.19 2065.17 us 16.00 us 37098954.00 us
>>>>>> 78381 FINODELK
>>>>>> 36.44 886.32 us 25.00 us 2216436.00 us
925421
>>>>>> READ
>>>>>> 46.30 1178.04 us 13.00 us 1700704.00 us
884635
>>>>>> INODELK
>>>>>>
>>>>>> Duration: 7485 seconds
>>>>>> Data Read: 71250527215 bytes
>>>>>> Data Written: 5119903744 bytes
>>>>>>
>>>>>> Brick: ovirt3.nwfiber.com:/gluster/brick2/data
>>>>>> ----------------------------------------------
>>>>>> Cumulative Stats:
>>>>>> Block Size: 1b+
>>>>>> No. of Reads: 0
>>>>>> No. of Writes: 3264419
>>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of
calls
>>>>>> Fop
>>>>>> --------- ----------- ----------- -----------
------------
>>>>>> ----
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
90
>>>>>> FORGET
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
9462
>>>>>> RELEASE
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
4254
>>>>>> RELEASEDIR
>>>>>> 0.00 50.52 us 13.00 us 190.00 us
71
>>>>>> FLUSH
>>>>>> 0.00 186.97 us 87.00 us 713.00 us
86
>>>>>> REMOVEXATTR
>>>>>> 0.00 79.32 us 33.00 us 189.00 us
228
>>>>>> LK
>>>>>> 0.00 220.98 us 129.00 us 513.00 us
86
>>>>>> SETATTR
>>>>>> 0.01 259.30 us 26.00 us 2632.00 us
137
>>>>>> READDIR
>>>>>> 0.02 322.76 us 145.00 us 2125.00 us
321
>>>>>> XATTROP
>>>>>> 0.03 109.55 us 2.00 us 1258.00 us
1193
>>>>>> OPENDIR
>>>>>> 0.05 70.21 us 21.00 us 431.00 us
3196
>>>>>> DISCARD
>>>>>> 0.05 169.26 us 21.00 us 2315.00 us
1545
>>>>>> GETXATTR
>>>>>> 0.12 176.85 us 63.00 us 2844.00 us
3206
>>>>>> OPEN
>>>>>> 0.61 303.49 us 90.00 us 3085.00 us
9633
>>>>>> FSTAT
>>>>>> 2.44 305.66 us 28.00 us 3716.00 us
38230
>>>>>> LOOKUP
>>>>>> 4.52 266.22 us 55.00 us 53424.00 us
81455
>>>>>> FXATTROP
>>>>>> 6.96 1397.99 us 51.00 us 64822.00 us
23889
>>>>>> FSYNC
>>>>>> 16.48 84.74 us 25.00 us 6917.00 us
932592
>>>>>> WRITE
>>>>>> 30.16 106.90 us 13.00 us 3920189.00 us
1353046
>>>>>> INODELK
>>>>>> 38.55 1794.52 us 14.00 us 16210553.00 us
>>>>>> 103039 FINODELK
>>>>>>
>>>>>> Duration: 66562 seconds
>>>>>> Data Read: 0 bytes
>>>>>> Data Written: 3264419 bytes
>>>>>>
>>>>>> Interval 1 Stats:
>>>>>> Block Size: 1b+
>>>>>> No. of Reads: 0
>>>>>> No. of Writes: 794080
>>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of
calls
>>>>>> Fop
>>>>>> --------- ----------- ----------- -----------
------------
>>>>>> ----
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
2093
>>>>>> RELEASE
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
1093
>>>>>> RELEASEDIR
>>>>>> 0.00 70.31 us 26.00 us 125.00 us
16
>>>>>> FLUSH
>>>>>> 0.00 193.10 us 103.00 us 713.00 us
71
>>>>>> REMOVEXATTR
>>>>>> 0.01 227.32 us 133.00 us 513.00 us
71
>>>>>> SETATTR
>>>>>> 0.01 79.32 us 33.00 us 189.00 us
228
>>>>>> LK
>>>>>> 0.01 259.83 us 35.00 us 1138.00 us
89
>>>>>> READDIR
>>>>>> 0.03 318.26 us 145.00 us 2047.00 us
253
>>>>>> XATTROP
>>>>>> 0.04 112.67 us 3.00 us 1258.00 us
1093
>>>>>> OPENDIR
>>>>>> 0.06 167.98 us 23.00 us 1951.00 us
1014
>>>>>> GETXATTR
>>>>>> 0.08 70.97 us 22.00 us 431.00 us
3053
>>>>>> DISCARD
>>>>>> 0.13 183.78 us 66.00 us 2844.00 us
2091
>>>>>> OPEN
>>>>>> 1.01 303.82 us 90.00 us 3085.00 us
9610
>>>>>> FSTAT
>>>>>> 3.27 316.59 us 30.00 us 3716.00 us
29820
>>>>>> LOOKUP
>>>>>> 5.83 265.79 us 59.00 us 53424.00 us
63296
>>>>>> FXATTROP
>>>>>> 7.95 1373.89 us 51.00 us 64822.00 us
16717
>>>>>> FSYNC
>>>>>> 23.17 851.99 us 14.00 us 16210553.00 us
>>>>>> 78555 FINODELK
>>>>>> 24.04 87.44 us 27.00 us 6917.00 us
794081
>>>>>> WRITE
>>>>>> 34.36 111.91 us 14.00 us 984871.00 us
886790
>>>>>> INODELK
>>>>>>
>>>>>> Duration: 7485 seconds
>>>>>> Data Read: 0 bytes
>>>>>> Data Written: 794080 bytes
>>>>>>
>>>>>>
>>>>>> -----------------------
>>>>>> Here is the data from the volume that is backed by the SHDDs and
has
>>>>>> one failed disk:
>>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile data-hdd info
>>>>>> Brick: 172.172.1.12:/gluster/brick3/data-hdd
>>>>>> --------------------------------------------
>>>>>> Cumulative Stats:
>>>>>> Block Size: 256b+ 512b+
>>>>>> 1024b+
>>>>>> No. of Reads: 1702 86
>>>>>> 16
>>>>>> No. of Writes: 0 767
>>>>>> 71
>>>>>>
>>>>>> Block Size: 2048b+ 4096b+
>>>>>> 8192b+
>>>>>> No. of Reads: 19 51841
>>>>>> 2049
>>>>>> No. of Writes: 76 60668
>>>>>> 35727
>>>>>>
>>>>>> Block Size: 16384b+ 32768b+
>>>>>> 65536b+
>>>>>> No. of Reads: 1744 639
>>>>>> 1088
>>>>>> No. of Writes: 8524 2410
>>>>>> 1285
>>>>>>
>>>>>> Block Size: 131072b+
>>>>>> No. of Reads: 771999
>>>>>> No. of Writes: 29584
>>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of
calls
>>>>>> Fop
>>>>>> --------- ----------- ----------- -----------
------------
>>>>>> ----
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
2902
>>>>>> RELEASE
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
1517
>>>>>> RELEASEDIR
>>>>>> 0.00 197.00 us 197.00 us 197.00 us
1
>>>>>> FTRUNCATE
>>>>>> 0.00 70.24 us 16.00 us 758.00 us
51
>>>>>> FLUSH
>>>>>> 0.00 143.93 us 82.00 us 305.00 us
57
>>>>>> REMOVEXATTR
>>>>>> 0.00 178.63 us 105.00 us 712.00 us
60
>>>>>> SETATTR
>>>>>> 0.00 67.30 us 19.00 us 572.00 us
555
>>>>>> LK
>>>>>> 0.00 322.80 us 23.00 us 4673.00 us
138
>>>>>> READDIR
>>>>>> 0.00 336.56 us 106.00 us 11994.00 us
237
>>>>>> XATTROP
>>>>>> 0.00 84.70 us 28.00 us 1071.00 us
3469
>>>>>> STATFS
>>>>>> 0.01 387.75 us 2.00 us 146017.00 us
1467
>>>>>> OPENDIR
>>>>>> 0.01 148.59 us 21.00 us 64374.00 us
4454
>>>>>> STAT
>>>>>> 0.02 783.02 us 16.00 us 93502.00 us
1902
>>>>>> GETXATTR
>>>>>> 0.03 1516.10 us 17.00 us 210690.00 us
1364
>>>>>> ENTRYLK
>>>>>> 0.03 2555.47 us 300.00 us 674454.00 us
1064
>>>>>> READDIRP
>>>>>> 0.07 85.74 us 19.00 us 68340.00 us
62849
>>>>>> FSTAT
>>>>>> 0.07 1978.12 us 59.00 us 202596.00 us
2729
>>>>>> OPEN
>>>>>> 0.22 708.57 us 15.00 us 394799.00 us
25447
>>>>>> LOOKUP
>>>>>> 5.94 2331.74 us 15.00 us 1099530.00 us
207534
>>>>>> FINODELK
>>>>>> 7.31 8311.75 us 58.00 us 1800216.00 us
71668
>>>>>> FXATTROP
>>>>>> 12.49 7735.19 us 51.00 us 3595513.00 us
131642
>>>>>> WRITE
>>>>>> 17.70 957.08 us 16.00 us 13700466.00 us
>>>>>> 1508160 INODELK
>>>>>> 24.55 2546.43 us 26.00 us 5077347.00 us
786060
>>>>>> READ
>>>>>> 31.56 49699.15 us 47.00 us 3746331.00 us
51777
>>>>>> FSYNC
>>>>>>
>>>>>> Duration: 10101 seconds
>>>>>> Data Read: 101562897361 bytes
>>>>>> Data Written: 4834450432 bytes
>>>>>>
>>>>>> Interval 0 Stats:
>>>>>> Block Size: 256b+ 512b+
>>>>>> 1024b+
>>>>>> No. of Reads: 1702 86
>>>>>> 16
>>>>>> No. of Writes: 0 767
>>>>>> 71
>>>>>>
>>>>>> Block Size: 2048b+ 4096b+
>>>>>> 8192b+
>>>>>> No. of Reads: 19 51841
>>>>>> 2049
>>>>>> No. of Writes: 76 60668
>>>>>> 35727
>>>>>>
>>>>>> Block Size: 16384b+ 32768b+
>>>>>> 65536b+
>>>>>> No. of Reads: 1744 639
>>>>>> 1088
>>>>>> No. of Writes: 8524 2410
>>>>>> 1285
>>>>>>
>>>>>> Block Size: 131072b+
>>>>>> No. of Reads: 771999
>>>>>> No. of Writes: 29584
>>>>>> %-latency Avg-latency Min-Latency Max-Latency No. of
calls
>>>>>> Fop
>>>>>> --------- ----------- ----------- -----------
------------
>>>>>> ----
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
2902
>>>>>> RELEASE
>>>>>> 0.00 0.00 us 0.00 us 0.00 us
1517
>>>>>> RELEASEDIR
>>>>>> 0.00 197.00 us 197.00 us 197.00 us
1
>>>>>> FTRUNCATE
>>>>>> 0.00 70.24 us 16.00 us 758.00 us
51
>>>>>> FLUSH
>>>>>> 0.00 143.93 us 82.00 us 305.00 us
57
>>>>>> REMOVEXATTR
>>>>>> 0.00 178.63 us 105.00 us 712.00 us
60
>>>>>> SETATTR
>>>>>> 0.00 67.30 us 19.00 us 572.00 us
555
>>>>>> LK
>>>>>> 0.00 322.80 us 23.00 us 4673.00 us
138
>>>>>> READDIR
>>>>>> 0.00 336.56 us 106.00 us 11994.00 us
237
>>>>>> XATTROP
>>>>>> 0.00 84.70 us 28.00 us 1071.00 us
3469
>>>>>> STATFS
>>>>>> 0.01 387.75 us 2.00 us 146017.00 us
1467
>>>>>> OPENDIR
>>>>>> 0.01 148.59 us 21.00 us 64374.00 us
4454
>>>>>> STAT
>>>>>> 0.02 783.02 us 16.00 us 93502.00 us
1902
>>>>>> GETXATTR
>>>>>> 0.03 1516.10 us 17.00 us 210690.00 us
1364
>>>>>> ENTRYLK
>>>>>> 0.03 2555.47 us 300.00 us 674454.00 us
1064
>>>>>> READDIRP
>>>>>> 0.07 85.73 us 19.00 us 68340.00 us
62849
>>>>>> FSTAT
>>>>>> 0.07 1978.12 us 59.00 us 202596.00 us
2729
>>>>>> OPEN
>>>>>> 0.22 708.57 us 15.00 us 394799.00 us
25447
>>>>>> LOOKUP
>>>>>> 5.94 2334.57 us 15.00 us 1099530.00 us
207534
>>>>>> FINODELK
>>>>>> 7.31 8311.49 us 58.00 us 1800216.00 us
71668
>>>>>> FXATTROP
>>>>>> 12.49 7735.32 us 51.00 us 3595513.00 us
131642
>>>>>> WRITE
>>>>>> 17.71 957.08 us 16.00 us 13700466.00 us
>>>>>> 1508160 INODELK
>>>>>> 24.56 2546.42 us 26.00 us 5077347.00 us
786060
>>>>>> READ
>>>>>> 31.54 49651.63 us 47.00 us 3746331.00 us
51777
>>>>>> FSYNC
>>>>>>
>>>>>> Duration: 10101 seconds
>>>>>> Data Read: 101562897361 bytes
>>>>>> Data Written: 4834450432 bytes
>>>>>>
>>>>>>
>>>>>> On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir
<jim(a)palousetech.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you for your response.
>>>>>>>
>>>>>>> I have 4 gluster volumes. 3 are replica 2 + arbitrator.
replica
>>>>>>> bricks are on ovirt1 and ovirt2, arbitrator on ovirt3. The
4th volume is
>>>>>>> replica 3, with a brick on all three ovirt machines.
>>>>>>>
>>>>>>> The first 3 volumes are on an SSD disk; the 4th is on a
Seagate
>>>>>>> SSHD (same in all three machines). On ovirt3, the SSHD has
reported hard
>>>>>>> IO failures, and that brick is offline. However, the other
two replicas
>>>>>>> are fully operational (although they still show contents in
the heal info
>>>>>>> command that won't go away, but that may be the case
until I replace the
>>>>>>> failed disk).
>>>>>>>
>>>>>>> What is bothering me is that ALL 4 gluster volumes are
showing
>>>>>>> horrible performance issues. At this point, as the bad disk
has been
>>>>>>> completely offlined, I would expect gluster to perform at
normal speed, but
>>>>>>> that is definitely not the case.
>>>>>>>
>>>>>>> I've also noticed that the performance hits seem to come
in waves:
>>>>>>> things seem to work acceptably (but slow) for a while, then
suddenly, its
>>>>>>> as if all disk IO on all volumes (including non-gluster local
OS disk
>>>>>>> volumes for the hosts) pause for about 30 seconds, then IO
resumes again.
>>>>>>> During those times, I start getting VM not responding and
host not
>>>>>>> responding notices as well as the applications having major
issues.
>>>>>>>
>>>>>>> I've shut down most of my VMs and am down to just my
essential core
>>>>>>> VMs (shedded about 75% of my VMs). I still am experiencing
the same issues.
>>>>>>>
>>>>>>> Am I correct in believing that once the failed disk was
brought
>>>>>>> offline that performance should return to normal?
>>>>>>>
>>>>>>> On Tue, May 29, 2018 at 1:27 PM, Alex K
<rightkicktech(a)gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would check disks status and accessibility of mount
points where
>>>>>>>> your gluster volumes reside.
>>>>>>>>
>>>>>>>> On Tue, May 29, 2018, 22:28 Jim Kusznir
<jim(a)palousetech.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On one ovirt server, I'm now seeing these
messages:
>>>>>>>>> [56474.239725] blk_update_request: 63 callbacks
suppressed
>>>>>>>>> [56474.239732] blk_update_request: I/O error, dev
dm-2, sector 0
>>>>>>>>> [56474.240602] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>> 3905945472
>>>>>>>>> [56474.241346] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>> 3905945584
>>>>>>>>> [56474.242236] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>> 2048
>>>>>>>>> [56474.243072] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>> 3905943424
>>>>>>>>> [56474.243997] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>> 3905943536
>>>>>>>>> [56474.247347] blk_update_request: I/O error, dev
dm-2, sector 0
>>>>>>>>> [56474.248315] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>> 3905945472
>>>>>>>>> [56474.249231] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>> 3905945584
>>>>>>>>> [56474.250221] blk_update_request: I/O error, dev
dm-2, sector
>>>>>>>>> 2048
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir <
>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>
>>>>>>>>>> I see in messages on ovirt3 (my 3rd machine, the
one upgraded to
>>>>>>>>>> 4.2):
>>>>>>>>>>
>>>>>>>>>> May 29 11:54:41 ovirt3 ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>> database connection failed (No such file or
directory)
>>>>>>>>>> May 29 11:54:51 ovirt3 ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>> database connection failed (No such file or
directory)
>>>>>>>>>> May 29 11:55:01 ovirt3 ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>> database connection failed (No such file or
directory)
>>>>>>>>>> (appears a lot).
>>>>>>>>>>
>>>>>>>>>> I also found on the ssh session of that, some
sysv warnings
>>>>>>>>>> about the backing disk for one of the gluster
volumes (straight replica
>>>>>>>>>> 3). The glusterfs process for that disk on that
machine went offline. Its
>>>>>>>>>> my understanding that it should continue to work
with the other two
>>>>>>>>>> machines while I attempt to replace that disk,
right? Attempted writes
>>>>>>>>>> (touching an empty file) can take 15 seconds,
repeating it later will be
>>>>>>>>>> much faster.
>>>>>>>>>>
>>>>>>>>>> Gluster generates a bunch of different log files,
I don't know
>>>>>>>>>> what ones you want, or from which machine(s).
>>>>>>>>>>
>>>>>>>>>> How do I do "volume profiling"?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose
<sabose(a)redhat.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Do you see errors reported in the mount logs
for the volume? If
>>>>>>>>>>> so, could you attach the logs?
>>>>>>>>>>> Any issues with your underlying disks. Can
you also attach
>>>>>>>>>>> output of volume profiling?
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir
<
>>>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ok, things have gotten MUCH worse this
morning. I'm getting
>>>>>>>>>>>> random errors from VMs, right now, about
a third of my VMs have been paused
>>>>>>>>>>>> due to storage issues, and most of the
remaining VMs are not performing
>>>>>>>>>>>> well.
>>>>>>>>>>>>
>>>>>>>>>>>> At this point, I am in full EMERGENCY
mode, as my production
>>>>>>>>>>>> services are now impacted, and I'm
getting calls coming in with problems...
>>>>>>>>>>>>
>>>>>>>>>>>> I'd greatly appreciate help...VMs are
running VERY slowly
>>>>>>>>>>>> (when they run), and they are steadily
getting worse. I don't know why. I
>>>>>>>>>>>> was seeing CPU peaks (to 100%) on several
VMs, in perfect sync, for a few
>>>>>>>>>>>> minutes at a time (while the VM became
unresponsive and any VMs I was
>>>>>>>>>>>> logged into that were linux were giving
me the CPU stuck messages in my
>>>>>>>>>>>> origional post). Is all this storage
related?
>>>>>>>>>>>>
>>>>>>>>>>>> I also have two different gluster volumes
for VM storage, and
>>>>>>>>>>>> only one had the issues, but now VMs in
both are being affected at the same
>>>>>>>>>>>> time and same way.
>>>>>>>>>>>>
>>>>>>>>>>>> --Jim
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 28, 2018 at 10:50 PM, Sahina
Bose <
>>>>>>>>>>>> sabose(a)redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> [Adding gluster-users to look at the
heal issue]
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 29, 2018 at 9:17 AM, Jim
Kusznir <
>>>>>>>>>>>>> jim(a)palousetech.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've been having some cluster
and gluster performance issues
>>>>>>>>>>>>>> lately. I also found that my
cluster was out of date, and was trying to
>>>>>>>>>>>>>> apply updates (hoping to fix some
of these), and discovered the ovirt 4.1
>>>>>>>>>>>>>> repos were taken completely
offline. So, I was forced to begin an upgrade
>>>>>>>>>>>>>> to 4.2. According to docs I
found/read, I needed only add the new repo, do
>>>>>>>>>>>>>> a yum update, reboot, and be good
on my hosts (did the yum update, the
>>>>>>>>>>>>>> engine-setup on my hosted
engine). Things seemed to work relatively well,
>>>>>>>>>>>>>> except for a gluster sync issue
that showed up.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My cluster is a 3 node
hyperconverged cluster. I upgraded
>>>>>>>>>>>>>> the hosted engine first, then
engine 3. When engine 3 came back up, for
>>>>>>>>>>>>>> some reason one of my gluster
volumes would not sync. Here's sample output:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root@ovirt3 ~]# gluster volume
heal data-hdd info
>>>>>>>>>>>>>> Brick
172.172.1.11:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick
172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick
172.172.1.13:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---------
>>>>>>>>>>>>>> Its been in this state for a
couple days now, and bandwidth
>>>>>>>>>>>>>> monitoring shows no appreciable
data moving. I've tried repeatedly
>>>>>>>>>>>>>> commanding a full heal from all
three clusters in the node. Its always the
>>>>>>>>>>>>>> same files that need healing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When running gluster volume heal
data-hdd statistics, I see
>>>>>>>>>>>>>> sometimes different information,
but always some number of "heal failed"
>>>>>>>>>>>>>> entries. It shows 0 for split
brain.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not quite sure what to
do. I suspect it may be due to
>>>>>>>>>>>>>> nodes 1 and 2 still being on the
older ovirt/gluster release, but I'm
>>>>>>>>>>>>>> afraid to upgrade and reboot them
until I have a good gluster sync (don't
>>>>>>>>>>>>>> need to create a split brain
issue). How do I proceed with this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Second issue: I've been
experiencing VERY POOR performance
>>>>>>>>>>>>>> on most of my VMs. To the tune
that logging into a windows 10 vm via
>>>>>>>>>>>>>> remote desktop can take 5
minutes, launching quickbooks inside said vm can
>>>>>>>>>>>>>> easily take 10 minutes. On some
linux VMs, I get random messages like this:
>>>>>>>>>>>>>> Message from syslogd@unifi at May
28 20:39:23 ...
>>>>>>>>>>>>>> kernel:[6171996.308904] NMI
watchdog: BUG: soft lockup -
>>>>>>>>>>>>>> CPU#0 stuck for 22s!
[mongod:14766]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (the process and PID are often
different)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not quite sure what to do
about this either. My initial
>>>>>>>>>>>>>> thought was upgrad everything to
current and see if its still there, but I
>>>>>>>>>>>>>> cannot move forward with that
until my gluster is healed...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>> Users mailing list --
users(a)ovirt.org
>>>>>>>>>>>>>> To unsubscribe send an email to
users-leave(a)ovirt.org
>>>>>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/pri
>>>>>>>>>>>>>> vacy-policy/
>>>>>>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>>>>>>> y/about/community-guidelines/
>>>>>>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>>>>>>
es/list/users(a)ovirt.org/message/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOT
>>>>>>>>>>>>>> IRQF/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list -- users(a)ovirt.org
>>>>>>>>> To unsubscribe send an email to
users-leave(a)ovirt.org
>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>> y/about/community-guidelines/
>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>
es/list/users(a)ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDES5MF/
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>> y/about/community-guidelines/
>>> List Archives:
https://lists.ovirt.org/archiv
>>> es/list/users(a)ovirt.org/message/JKBSLSZFRMFGILXGZB4YAMJIJKEKNGMA/
>>>
>>>
>>
>