[ovirt-users] VDSM hang

Nir Soffer nsoffer at redhat.com
Fri Mar 10 12:56:01 UTC 2017


On Fri, Mar 10, 2017 at 1:07 PM, paf1 at email.cz <paf1 at email.cz> wrote:
> Hello everybody,
>
> for production usage i'm testing  ovirt with gluster.
> All components seems to be running fine but whenever I'm testing huge
> workload, then node freez. Not the main OS, but VDSM mgmt and attached
> services, VMs eg.

What do you mean by freez?

>
> mgmt
> oVirt - 4.1.0.4
> centos 7.3-1611
>
>
> nodes ( installed from ovirt image
> "ovirt-node-ng-installer-ovirt-4.1-2017030804.iso"  )
>
> OS Version: == RHEL - 7 - 3.1611.el7.centos
> OS Description:== oVirt Node 4.1.0
> Kernel Version:== 3.10.0 - 514.10.2.el7.x86_64
> KVM Version:== 2.6.0 - 28.el7_3.3.1
> LIBVIRT Version:== libvirt-2.0.0-10.el7_3.5
> VDSM Version:== vdsm-4.19.4-1.el7.centos
> SPICE Version:== 0.12.4 - 20.el7_3
> GlusterFS Version:== glusterfs-3.8.9-1.el7  ( LVM thinprovisioning in
> replica 2 - created from ovirt GUI )
>
> concurently running
> - huge import from export domain    ( net workload )
> - sequential write to VMs local disk ( gluster replica sequential workload )
> - VMs database huge select  (  random IOps )
> - huge old snapshot delete  ( random IOps )
>
> In this configuration / workload  is  runnig one hour eg, with no exceptions
> , with 70-80% disk load, but in some point VDSM freez  all jobs for a
> timeout and VM's are in "uknown" status .
> The whole system revitalize then automaticaly in cca 20min time frame (
> except the import and snapshot deleting(rollback) )
>
> engine.log  - focus 10:39:07 time  ( Failed in 'HSMGetAllTasksStatusesVDS'
> method )
> ========
>
> n child command id: 'a8a3a4d5-cf7d-4423-8243-022911232508'
> type:'RemoveSnapshotSingleDiskLive' to complete
> 2017-03-10 10:39:01,727+01 INFO
> [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback]
> (DefaultQuartzScheduler2) [759c8e1f] Command 'RemoveSnapshotSingleDiskLive'
> (id: 'a8a3a4d5-cf7d-4423-8243-022911232508') waiting on child command id:
> '33df2c1e-6ce3-44fd-a39b-d111883b4c4e' type:'DestroyImage' to complete
> 2017-03-10 10:39:03,929+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler5) [fde51205-3e8b-4b84-a478-352dc444ccc4] START,
> GlusterServersListVDSCommand(HostName = 2kvm1,
> VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 446d0cd3
> 2017-03-10 10:39:04,343+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler5) [fde51205-3e8b-4b84-a478-352dc444ccc4] FINISH,
> GlusterServersListVDSCommand, return: [172.16.5.163/24:CONNECTED,
> 16.0.0.164:CONNECTED], log id: 446d0cd3
> 2017-03-10 10:39:04,353+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler5) [fde51205-3e8b-4b84-a478-352dc444ccc4] START,
> GlusterVolumesListVDSCommand(HostName = 2kvm1,
> GlusterVolumesListVDSParameters:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 69ea1fda
> 2017-03-10 10:39:05,128+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler5) [fde51205-3e8b-4b84-a478-352dc444ccc4] FINISH,
> GlusterVolumesListVDSCommand, return:
> {8ded4083-2f31-489e-a60d-a315a5eb9b22=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity at 7765e4ad},
> log id: 69ea1fda
> 2017-03-10 10:39:07,163+01 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand]
> (DefaultQuartzScheduler2) [759c8e1f] Failed in 'HSMGetAllTasksStatusesVDS'
> method
> 2017-03-10 10:39:07,178+01 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler2) [759c8e1f] EVENT_ID:
> VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null,
> Custom Event ID: -1, Message: VDSM 2kvm2 command HSMGetAllTasksStatusesVDS
> failed: Connection timed out
> 2017-03-10 10:39:07,182+01 INFO
> [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler2)
> [759c8e1f] BaseAsyncTask::onTaskEndSuccess: Task
> 'f594bf69-619b-4d1b-8f6d-a9826997e478' (Parent Command 'ImportVm',
> Parameters Type
> 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended
> successfully.
> 2017-03-10 10:39:07,182+01 INFO
> [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (DefaultQuartzScheduler2)
> [759c8e1f] Task with DB Task ID 'a05c7c07-9b98-4ab2-ac7b-9e70a75ba7b7' and
> VDSM Task ID '7c60369f-70a3-4a6a-80c9-4753ac9ed372' is in state Polling. End
> action for command 8deb3fe3-4a83-4605-816c-ffdc63fd9ac1 will proceed when
> all the entity's tasks are completed.
> 2017-03-10 10:39:07,182+01 INFO
> [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler2)
> [759c8e1f] SPMAsyncTask::PollTask: Polling task
> 'f351e8f6-6dd7-49aa-bf54-650d84fc6352' (Parent Command 'DestroyImage',
> Parameters Type
> 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned
> status 'finished', result 'cleanSuccess'.
> 2017-03-10 10:39:07,182+01 ERROR
> [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler2)
> [759c8e1f] BaseAsyncTask::logEndTaskFailure: Task
> 'f351e8f6-6dd7-49aa-bf54-650d84fc6352' (Parent Command 'DestroyImage',
> Parameters Type
> 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with
> failure:
> -- Result: 'cleanSuccess'
> -- Message: 'VDSGenericException: VDSErrorException: Failed to
> HSMGetAllTasksStatusesVDS, error = Connection timed out, code = 100',
> -- Exception: 'VDSGenericException: VDSErrorException: Failed to
> HSMGetAllTasksStatusesVDS, error = Connection timed out, code = 100'
> 2017-03-10 10:39:07,184+01 INFO
> [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler2)
> [759c8e1f] CommandAsyncTask::endActionIfNecessary: All tasks of command
> '33df2c1e-6ce3-44fd-a39b-d111883b4c4e' has ended -> executing 'endAction'
> 2017-03-10 10:39:07,185+01 INFO
> [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (DefaultQuartzScheduler2)
> [759c8e1f] CommandAsyncTask::endAction: Ending action for '1' tasks (command
> ID: '33df2c1e-6ce3-44fd-a39b-d111883b4c4e'): calling endAction '.
> 2017-03-10 10:39:07,185+01 INFO
> [org.ovirt.engine.core.bll.tasks.CommandAsyncTask]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f]
> CommandAsyncTask::endCommandAction [within thread] context: Attempting to
> endAction 'DestroyImage',
> 2017-03-10 10:39:07,192+01 INFO
> [org.ovirt.engine.core.bll.storage.disk.image.DestroyImageCommand]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f] Command
> [id=33df2c1e-6ce3-44fd-a39b-d111883b4c4e]: Updating status to 'FAILED', The
> command end method logic will be executed by one of its parent commands.
> 2017-03-10 10:39:07,192+01 INFO
> [org.ovirt.engine.core.bll.tasks.CommandAsyncTask]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f]
> CommandAsyncTask::HandleEndActionResult [within thread]: endAction for
> action type 'DestroyImage' completed, handling the result.
> 2017-03-10 10:39:07,192+01 INFO
> [org.ovirt.engine.core.bll.tasks.CommandAsyncTask]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f]
> CommandAsyncTask::HandleEndActionResult [within thread]: endAction for
> action type 'DestroyImage' succeeded, clearing tasks.
> 2017-03-10 10:39:07,192+01 INFO
> [org.ovirt.engine.core.bll.tasks.SPMAsyncTask]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f] SPMAsyncTask::ClearAsyncTask:
> Attempting to clear task 'f351e8f6-6dd7-49aa-bf54-650d84fc6352'
> 2017-03-10 10:39:07,193+01 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f] START,
> SPMClearTaskVDSCommand(
> SPMTaskGuidBaseVDSCommandParameters:{runAsync='true',
> storagePoolId='00000001-0001-0001-0001-000000000311',
> ignoreFailoverLimit='false',
> taskId='f351e8f6-6dd7-49aa-bf54-650d84fc6352'}), log id: 2b7080c2
> 2017-03-10 10:39:07,194+01 INFO
> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f] START,
> HSMClearTaskVDSCommand(HostName = 2kvm2,
> HSMTaskGuidBaseVDSCommandParameters:{runAsync='true',
> hostId='905375e1-6de4-4fdf-b69c-b2d546f869c8',
> taskId='f351e8f6-6dd7-49aa-bf54-650d84fc6352'}), log id: 2edff460
> 2017-03-10 10:39:08,208+01 INFO
> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f] FINISH,
> HSMClearTaskVDSCommand, log id: 2edff460
> 2017-03-10 10:39:08,208+01 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f] FINISH,
> SPMClearTaskVDSCommand, log id: 2b7080c2
> 2017-03-10 10:39:08,213+01 INFO
> [org.ovirt.engine.core.bll.tasks.SPMAsyncTask]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f]
> BaseAsyncTask::removeTaskFromDB: Removed task
> 'f351e8f6-6dd7-49aa-bf54-650d84fc6352' from DataBase
> 2017-03-10 10:39:08,213+01 INFO
> [org.ovirt.engine.core.bll.tasks.CommandAsyncTask]
> (org.ovirt.thread.pool-6-thread-31) [759c8e1f]
> CommandAsyncTask::HandleEndActionResult [within thread]: Removing
> CommandMultiAsyncTasks object for entity
> '33df2c1e-6ce3-44fd-a39b-d111883b4c4e'
> 2017-03-10 10:39:10,142+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler10) [a86dc7b5-52dc-40d4-a3b9-49d7eabbb93c] START,
> GlusterServersListVDSCommand(HostName = 2kvm1,
> VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 2e7278cb
> 2017-03-10 10:39:11,513+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler10) [a86dc7b5-52dc-40d4-a3b9-49d7eabbb93c] FINISH,
> GlusterServersListVDSCommand, return: [172.16.5.163/24:CONNECTED,
> 16.0.0.164:CONNECTED], log id: 2e7278cb
> 2017-03-10 10:39:11,523+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler10) [a86dc7b5-52dc-40d4-a3b9-49d7eabbb93c] START,
> GlusterVolumesListVDSCommand(HostName = 2kvm1,
> GlusterVolumesListVDSParameters:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 43704ef2
> 2017-03-10 10:39:11,777+01 INFO
> [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
> (DefaultQuartzScheduler9) [67e1d8ed] Command 'RemoveSnapshot' (id:
> '13c2cb7c-0809-4971-aceb-37ae66105ab7') waiting on child command id:
> 'a8a3a4d5-cf7d-4423-8243-022911232508' type:'RemoveSnapshotSingleDiskLive'
> to complete
> 2017-03-10 10:39:11,789+01 WARN
> [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand]
> (DefaultQuartzScheduler9) [759c8e1f] Child command 'DESTROY_IMAGE' failed,
> proceeding to verify
> 2017-03-10 10:39:11,789+01 INFO
> [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand]
> (DefaultQuartzScheduler9) [759c8e1f] Executing Live Merge command step
> 'DESTROY_IMAGE_CHECK'
> 2017-03-10 10:39:11,832+01 INFO
> [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-5-thread-7)
> [4856f570] Running command: DestroyImageCheckCommand internal: true.
> 2017-03-10 10:39:11,833+01 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.SPMGetVolumeInfoVDSCommand]
> (pool-5-thread-7) [4856f570] START, SPMGetVolumeInfoVDSCommand(
> SPMGetVolumeInfoVDSCommandParameters:{expectedEngineErrors='[VolumeDoesNotExist]',
> runAsync='true', storagePoolId='00000001-0001-0001-0001-000000000311',
> ignoreFailoverLimit='false',
> storageDomainId='1603cd90-92ef-4c03-922c-cecb282fd00e',
> imageGroupId='7543338a-3ca6-4698-bb50-c14f0bd71428',
> imageId='50b592f7-bfba-4398-879c-8d6a19a2c000'}), log id: 2c8031f8
> 2017-03-10 10:39:11,833+01 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.SPMGetVolumeInfoVDSCommand]
> (pool-5-thread-7) [4856f570] Executing GetVolumeInfo using the current SPM
> 2017-03-10 10:39:11,834+01 INFO
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand]
> (pool-5-thread-7) [4856f570] START, GetVolumeInfoVDSCommand(HostName =
> 2kvm2,
> GetVolumeInfoVDSCommandParameters:{expectedEngineErrors='[VolumeDoesNotExist]',
> runAsync='true', hostId='905375e1-6de4-4fdf-b69c-b2d546f869c8',
> storagePoolId='00000001-0001-0001-0001-000000000311',
> storageDomainId='1603cd90-92ef-4c03-922c-cecb282fd00e',
> imageGroupId='7543338a-3ca6-4698-bb50-c14f0bd71428',
> imageId='50b592f7-bfba-4398-879c-8d6a19a2c000'}), log id: 79ca86cc
> 2017-03-10 10:39:11,846+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler10) [a86dc7b5-52dc-40d4-a3b9-49d7eabbb93c] FINISH,
> GlusterVolumesListVDSCommand, return:
> {8ded4083-2f31-489e-a60d-a315a5eb9b22=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity at 7765e4ad},
> log id: 43704ef2
> 2017-03-10 10:39:16,858+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler7) [d82701d9-9fa3-467d-b273-f5fe5a93062f] START,
> GlusterServersListVDSCommand(HostName = 2kvm1,
> VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 6542adcd
> 2017-03-10 10:39:17,394+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler7) [d82701d9-9fa3-467d-b273-f5fe5a93062f] FINISH,
> GlusterServersListVDSCommand, return: [172.16.5.163/24:CONNECTED,
> 16.0.0.164:CONNECTED], log id: 6542adcd
> 2017-03-10 10:39:17,406+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler7) [d82701d9-9fa3-467d-b273-f5fe5a93062f] START,
> GlusterVolumesListVDSCommand(HostName = 2kvm1,
> GlusterVolumesListVDSParameters:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 44ec33ed
> 2017-03-10 10:39:18,598+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler7) [d82701d9-9fa3-467d-b273-f5fe5a93062f] FINISH,
> GlusterVolumesListVDSCommand, return:
> {8ded4083-2f31-489e-a60d-a315a5eb9b22=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity at 7765e4ad},
> log id: 44ec33ed
> 2017-03-10 10:39:21,865+01 INFO
> [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
> (DefaultQuartzScheduler6) [67e1d8ed] Command 'RemoveSnapshot' (id:
> '13c2cb7c-0809-4971-aceb-37ae66105ab7') waiting on child command id:
> 'a8a3a4d5-cf7d-4423-8243-022911232508' type:'RemoveSnapshotSingleDiskLive'
> to complete
> 2017-03-10 10:39:21,881+01 INFO
> [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback]
> (DefaultQuartzScheduler6) [4856f570] Command 'RemoveSnapshotSingleDiskLive'
> (id: 'a8a3a4d5-cf7d-4423-8243-022911232508') waiting on child command id:
> 'b1d63b8e-19d3-4d64-8fa8-4eb3e2d1a8fc' type:'DestroyImageCheck' to complete
> 2017-03-10 10:39:23,611+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler6) [4856f570] START,
> GlusterServersListVDSCommand(HostName = 2kvm1,
> VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 4c2fc22d
> 2017-03-10 10:39:24,616+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand]
> (DefaultQuartzScheduler7) [d82701d9-9fa3-467d-b273-f5fe5a93062f] START,
> GlusterTasksListVDSCommand(HostName = 2kvm1,
> VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 1f169371
> 2017-03-10 10:39:24,618+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler6) [4856f570] FINISH, GlusterServersListVDSCommand,
> return: [172.16.5.163/24:CONNECTED, 16.0.0.164:CONNECTED], log id: 4c2fc22d
> 2017-03-10 10:39:24,629+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler6) [4856f570] START,
> GlusterVolumesListVDSCommand(HostName = 2kvm1,
> GlusterVolumesListVDSParameters:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 2ac55735
> 2017-03-10 10:39:24,822+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand]
> (DefaultQuartzScheduler7) [d82701d9-9fa3-467d-b273-f5fe5a93062f] FINISH,
> GlusterTasksListVDSCommand, return: [], log id: 1f169371
> 2017-03-10 10:39:26,836+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler6) [4856f570] FINISH, GlusterVolumesListVDSCommand,
> return:
> {8ded4083-2f31-489e-a60d-a315a5eb9b22=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity at 7765e4ad},
> log id: 2ac55735
> 2017-03-10 10:39:31,849+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler8) [fde51205-3e8b-4b84-a478-352dc444ccc4] START,
> GlusterServersListVDSCommand(HostName = 2kvm1,
> VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 2e8dbcd1
> 2017-03-10 10:39:31,932+01 INFO
> [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
> (DefaultQuartzScheduler6) [67e1d8ed] Command 'RemoveSnapshot' (id:
> '13c2cb7c-0809-4971-aceb-37ae66105ab7') waiting on child command id:
> 'a8a3a4d5-cf7d-4423-8243-022911232508' type:'RemoveSnapshotSingleDiskLive'
> to complete
> 2017-03-10 10:39:31,944+01 INFO
> [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback]
> (DefaultQuartzScheduler6) [4856f570] Command 'RemoveSnapshotSingleDiskLive'
> (id: 'a8a3a4d5-cf7d-4423-8243-022911232508') waiting on child command id:
> 'b1d63b8e-19d3-4d64-8fa8-4eb3e2d1a8fc' type:'DestroyImageCheck' to complete
> 2017-03-10 10:39:33,213+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler8) [fde51205-3e8b-4b84-a478-352dc444ccc4] FINISH,
> GlusterServersListVDSCommand, return: [172.16.5.163/24:CONNECTED,
> 16.0.0.164:CONNECTED], log id: 2e8dbcd1
> 2017-03-10 10:39:33,226+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler8) [fde51205-3e8b-4b84-a478-352dc444ccc4] START,
> GlusterVolumesListVDSCommand(HostName = 2kvm1,
> GlusterVolumesListVDSParameters:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 1fb3f9e3
> 2017-03-10 10:39:34,375+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler8) [fde51205-3e8b-4b84-a478-352dc444ccc4] FINISH,
> GlusterVolumesListVDSCommand, return:
> {8ded4083-2f31-489e-a60d-a315a5eb9b22=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity at 7765e4ad},
> log id: 1fb3f9e3
> 2017-03-10 10:39:39,392+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler9) [12d6d15f-e054-4833-bd87-58f6a51e5fa6] START,
> GlusterServersListVDSCommand(HostName = 2kvm1,
> VdsIdVDSCommandParametersBase:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 1e0b8eeb
> 2017-03-10 10:39:40,753+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand]
> (DefaultQuartzScheduler9) [12d6d15f-e054-4833-bd87-58f6a51e5fa6] FINISH,
> GlusterServersListVDSCommand, return: [172.16.5.163/24:CONNECTED,
> 16.0.0.164:CONNECTED], log id: 1e0b8eeb
> 2017-03-10 10:39:40,763+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler9) [12d6d15f-e054-4833-bd87-58f6a51e5fa6] START,
> GlusterVolumesListVDSCommand(HostName = 2kvm1,
> GlusterVolumesListVDSParameters:{runAsync='true',
> hostId='86876b79-71d8-4ae1-883b-ba010cd270e7'}), log id: 35b04b33
> 2017-03-10 10:39:41,952+01 INFO
> [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
> (DefaultQuartzScheduler9) [12d6d15f-e054-4833-bd87-58f6a51e5fa6] FINISH,
> GlusterVolumesListVDSCommand, return:
> {8ded4083-2f31-489e-a60d-a315a5eb9b22=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity at 7765e4ad},
> log id: 35b04b33
> 2017-03-10 10:39:41,991+01 INFO
> [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
> (DefaultQuartzScheduler6) [67e1d8ed] Command 'RemoveSnapshot' (id:
> '13c2cb7c-0809-4971-aceb-37ae66105ab7') waiting on child command id:
> 'a8a3a4d5-cf7d-4423-8243-022911232508' type:'RemoveSnapshotSingleDiskLive'
> to complete
>
>
> gluster  ( nothing in logs )
> ======
>
>
> ## "etc-glusterfs-glusterd.vol.log"
> [2017-03-10 10:13:52.599019] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:16:48.639635] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> The message "I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1" repeated 3 times between
> [2017-03-10 10:16:48.639635] and [2017-03-10 10:17:55.659379]
> [2017-03-10 10:18:56.875516] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:19:57.204689] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:21:56.576879] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:21:57.772857] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:24:00.617931] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:30:04.918080] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:31:06.128638] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:32:07.325672] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:32:12.433586] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:32:13.544909] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:35:10.039213] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:37:19.905314] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:37:20.174209] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:38:12.635460] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
> [2017-03-10 10:40:14.169864] I [MSGID: 106499]
> [glusterd-handler.c:4349:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume slow1
>
>
> ## "rhev-data-center-mnt-glusterSD-localhost:_slow1.log"
> [2017-03-10 09:43:40.346785] W [MSGID: 101159] [inode.c:1214:__inode_unlink]
> 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/b6f2d08d-2441-4111-ab62-e14abdfaf602.61849:
> dentry not found in 43e6968f-9c2a-40d8-8074-caf1a36f60cf
> [2017-03-10 09:43:40.347076] W [MSGID: 101159] [inode.c:1214:__inode_unlink]
> 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/b6f2d08d-2441-4111-ab62-e14abdfaf602.61879:
> dentry not found in 902a6e3d-b7aa-439f-8262-cdc1b7f9f022
> [2017-03-10 09:43:40.347145] W [MSGID: 101159] [inode.c:1214:__inode_unlink]
> 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/b6f2d08d-2441-4111-ab62-e14abdfaf602.61935:
> dentry not found in 846bbcfc-f2b3-4ab6-af44-aeaa10b39318
> [2017-03-10 09:43:40.347211] W [MSGID: 101159] [inode.c:1214:__inode_unlink]
> 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/b6f2d08d-2441-4111-ab62-e14abdfaf602.61922:
> dentry not found in 66ad3bc5-26c7-4360-b33b-a084e3305cf8
> [2017-03-10 09:43:40.351571] W [MSGID: 101159] [inode.c:1214:__inode_unlink]
> 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/b6f2d08d-2441-4111-ab62-e14abdfaf602.61834:
> dentry not found in 3b8278e1-40e5-4363-b21e-7bffcd024c62
> [2017-03-10 09:43:40.352449] W [MSGID: 101159] [inode.c:1214:__inode_unlink]
> 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/b6f2d08d-2441-4111-ab62-e14abdfaf602.61870:
> dentry not found in 282f4c05-e09a-48e0-96a3-52e079ff2f73
> [2017-03-10 09:50:38.829325] I [MSGID: 109066]
> [dht-rename.c:1569:dht_rename] 0-slow1-dht: renaming
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/014ca3aa-d5f5-4b88-8f84-be8d4c5dfc1e/f147532a-89fa-49e0-8225-f82343fca8be.meta.new
> (hash=slow1-replicate-0/cache=slow1-replicate-0) =>
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/014ca3aa-d5f5-4b88-8f84-be8d4c5dfc1e/f147532a-89fa-49e0-8225-f82343fca8be.meta
> (hash=slow1-replicate-0/cache=slow1-replicate-0)
> [2017-03-10 09:50:42.221775] I [MSGID: 109066]
> [dht-rename.c:1569:dht_rename] 0-slow1-dht: renaming
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/4cf7dd90-9dcc-428c-82bc-fbf08dbee0be/12812d56-1606-4bf8-a391-0a2cacbd020b.meta.new
> (hash=slow1-replicate-0/cache=slow1-replicate-0) =>
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/4cf7dd90-9dcc-428c-82bc-fbf08dbee0be/12812d56-1606-4bf8-a391-0a2cacbd020b.meta
> (hash=slow1-replicate-0/cache=slow1-replicate-0)
> [2017-03-10 09:50:45.956432] I [MSGID: 109066]
> [dht-rename.c:1569:dht_rename] 0-slow1-dht: renaming
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/3cef54b4-45b9-4f5b-82c2-fcc8def06a37/85287865-38f0-45df-9e6c-1294913cbb88.meta.new
> (hash=slow1-replicate-0/cache=slow1-replicate-0) =>
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/3cef54b4-45b9-4f5b-82c2-fcc8def06a37/85287865-38f0-45df-9e6c-1294913cbb88.meta
> (hash=slow1-replicate-0/cache=slow1-replicate-0)
> [2017-03-10 09:50:40.349563] I [MSGID: 109066]
> [dht-rename.c:1569:dht_rename] 0-slow1-dht: renaming
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/014ca3aa-d5f5-4b88-8f84-be8d4c5dfc1e/f147532a-89fa-49e0-8225-f82343fca8be.meta.new
> (hash=slow1-replicate-0/cache=slow1-replicate-0) =>
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/014ca3aa-d5f5-4b88-8f84-be8d4c5dfc1e/f147532a-89fa-49e0-8225-f82343fca8be.meta
> (hash=slow1-replicate-0/cache=slow1-replicate-0)
> [2017-03-10 09:50:44.503866] I [MSGID: 109066]
> [dht-rename.c:1569:dht_rename] 0-slow1-dht: renaming
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/4cf7dd90-9dcc-428c-82bc-fbf08dbee0be/12812d56-1606-4bf8-a391-0a2cacbd020b.meta.new
> (hash=slow1-replicate-0/cache=slow1-replicate-0) =>
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/4cf7dd90-9dcc-428c-82bc-fbf08dbee0be/12812d56-1606-4bf8-a391-0a2cacbd020b.meta
> (hash=slow1-replicate-0/cache=slow1-replicate-0)
> [2017-03-10 09:59:46.860762] W [MSGID: 101159] [inode.c:1214:__inode_unlink]
> 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/6e105aa3-a3fc-4aca-be50-78b7642c4072.6684:
> dentry not found in d1e65eea-8758-4407-ac2e-3605dc661364
> [2017-03-10 10:02:22.500865] W [MSGID: 101159] [inode.c:1214:__inode_unlink]
> 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/6e105aa3-a3fc-4aca-be50-78b7642c4072.8767:
> dentry not found in e228bb28-9602-4f8e-8323-7434d77849fc
> [2017-03-10 10:04:03.103839] W [MSGID: 101159] [inode.c:1214:__inode_unlink]
> 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/6e105aa3-a3fc-4aca-be50-78b7642c4072.9787:
> dentry not found in 6be71632-aa36-4975-b673-1357e0355027
> [2017-03-10 10:06:02.406385] I [MSGID: 109066]
> [dht-rename.c:1569:dht_rename] 0-slow1-dht: renaming
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/2a9c1c6a-f045-4dce-a47b-95a2267eef72/6f264695-0669-4b49-a2f6-e6c92482f2fb.meta.new
> (hash=slow1-replicate-0/cache=slow1-replicate-0) =>
> /1603cd90-92ef-4c03-922c-cecb282fd00e/images/2a9c1c6a-f045-4dce-a47b-95a2267eef72/6f264695-0669-4b49-a2f6-e6c92482f2fb.meta
> (hash=slow1-replicate-0/cache=slow1-replicate-0)
> ... no other record
>
>
> messages
> ========
>
> several times occured:
>
> Mar 10 09:04:38 2kvm2 lvmetad: WARNING: Ignoring unsupported value for cmd.
> Mar 10 09:04:38 2kvm2 lvmetad: WARNING: Ignoring unsupported value for cmd.
> Mar 10 09:04:38 2kvm2 lvmetad: WARNING: Ignoring unsupported value for cmd.
> Mar 10 09:04:38 2kvm2 lvmetad: WARNING: Ignoring unsupported value for cmd.
> Mar 10 09:10:01 2kvm2 systemd: Started Session 274 of user root.
> Mar 10 09:10:01 2kvm2 systemd: Starting Session 274 of user root.
> Mar 10 09:20:02 2kvm2 systemd: Started Session 275 of user root.
> Mar 10 09:20:02 2kvm2 systemd: Starting Session 275 of user root.
> Mar 10 09:22:59 2kvm2 sanlock[1673]: 2017-03-10 09:22:59+0100 136031 [2576]:
> s3 delta_renew long write time 11 sec

Sanlock cannot write to storage

> Mar 10 09:24:03 2kvm2 kernel: kswapd1: page allocation failure: order:2,
> mode:0x104020

Kernel cannot allocate page?

> Mar 10 09:24:03 2kvm2 kernel: CPU: 42 PID: 265 Comm: kswapd1 Tainted: G
> I    ------------   3.10.0-514.10.2.el7.x86_64 #1
> Mar 10 09:24:03 2kvm2 kernel: Hardware name: Supermicro X10DRC/X10DRi-LN4+,
> BIOS 1.0a 08/29/2014
> Mar 10 09:24:03 2kvm2 kernel: 0000000000104020 00000000f7228dc9
> ffff88301f4839d8 ffffffff816864ef
> Mar 10 09:24:03 2kvm2 kernel: ffff88301f483a68 ffffffff81186ba0
> 000068fc00000000 0000000000000000
> Mar 10 09:24:03 2kvm2 kernel: fffffffffffffffc 0010402000000000
> ffff88301567ae80 00000000f7228dc9
> Mar 10 09:24:03 2kvm2 kernel: Call Trace:
> Mar 10 09:24:03 2kvm2 kernel: <IRQ>  [<ffffffff816864ef>]
> dump_stack+0x19/0x1b
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81186ba0>]
> warn_alloc_failed+0x110/0x180
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81682083>]
> __alloc_pages_slowpath+0x6b7/0x725
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff8118b155>]
> __alloc_pages_nodemask+0x405/0x420
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff811cf30a>]
> alloc_pages_current+0xaa/0x170
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81185a7e>] __get_free_pages+0xe/0x50
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff811dabae>]
> kmalloc_order_trace+0x2e/0xa0
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff811dd381>] __kmalloc+0x221/0x240
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffffa02f83fa>]
> bnx2x_frag_alloc.isra.62+0x2a/0x40 [bnx2x]
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffffa02f92f7>] bnx2x_rx_int+0x227/0x17b0
> [bnx2x]
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81033669>] ? sched_clock+0x9/0x10
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffffa02fc72d>] bnx2x_poll+0x1dd/0x260
> [bnx2x]
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff815705e0>] net_rx_action+0x170/0x380
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff8108f2cf>] __do_softirq+0xef/0x280
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff8169859c>] call_softirq+0x1c/0x30
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff8102d365>] do_softirq+0x65/0xa0
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff8108f665>] irq_exit+0x115/0x120
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81699138>] do_IRQ+0x58/0xf0
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff8168e2ad>]
> common_interrupt+0x6d/0x6d
> Mar 10 09:24:03 2kvm2 kernel: <EOI>  [<ffffffff81189a73>] ?
> free_hot_cold_page+0x103/0x160
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81189b16>]
> free_hot_cold_page_list+0x46/0xa0
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81195193>]
> shrink_page_list+0x543/0xb00
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81195dda>]
> shrink_inactive_list+0x1fa/0x630
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81196975>] shrink_lruvec+0x385/0x770
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81196dd6>] shrink_zone+0x76/0x1a0
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff8119807c>] balance_pgdat+0x48c/0x5e0
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81198343>] kswapd+0x173/0x450
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff810b17d0>] ?
> wake_up_atomic_t+0x30/0x30
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff811981d0>] ?
> balance_pgdat+0x5e0/0x5e0
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff810b0630>] ?
> kthread_create_on_node+0x140/0x140
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90
> Mar 10 09:24:03 2kvm2 kernel: [<ffffffff810b0630>] ?
> kthread_create_on_node+0x140/0x140
> Mar 10 09:24:03 2kvm2 kernel: kswapd1: page allocation failure: order:2,
> mode:0x104020
> Mar 10 09:24:03 2kvm2 kernel: CPU: 42 PID: 265 Comm: kswapd1 Tainted: G
> I    ------------   3.10.0-514.10.2.el7.x86_64 #1
> Mar 10 09:24:03 2kvm2 kernel: Hardware name: Supermicro X10DRC/X10DRi-LN4+,
> BIOS 1.0a 08/29/2014
> Mar 10 09:24:03 2kvm2 kernel: 0000000000104020 00000000f7228dc9
> ffff88301f4839d8 ffffffff816864ef
>
>
> and in critical time again
>
> Mar 10 10:37:53 2kvm2 sanlock[1673]: 2017-03-10 10:37:53+0100 140524 [1673]:
> s3 check_our_lease warning 73 last_success 140451
> Mar 10 10:37:54 2kvm2 sanlock[1673]: 2017-03-10 10:37:54+0100 140525 [1673]:
> s3 check_our_lease warning 74 last_success 140451

Sanlock could not renew the lease for 74 seconds

> Mar 10 10:37:54 2kvm2 wdmd[1732]: test warning now 140526 ping 140516 close
> 0 renewal 140451 expire 140531 client 1673
> sanlock_1603cd90-92ef-4c03-922c-cecb282fd00e:1
> Mar 10 10:37:54 2kvm2 kernel: watchdog watchdog0: watchdog did not stop!
> Mar 10 10:37:54 2kvm2 wdmd[1732]: /dev/watchdog0 closed unclean
> Mar 10 10:37:55 2kvm2 sanlock[1673]: 2017-03-10 10:37:55+0100 140526 [1673]:
> s3 check_our_lease warning 75 last_success 140451
> Mar 10 10:37:55 2kvm2 wdmd[1732]: test warning now 140527 ping 140516 close
> 140526 renewal 140451 expire 140531 client 1673
> sanlock_1603cd90-92ef-4c03-922c-cecb282fd00e:1
> Mar 10 10:37:56 2kvm2 sanlock[1673]: 2017-03-10 10:37:56+0100 140527 [1673]:
> s3 check_our_lease warning 76 last_success 140451
> Mar 10 10:37:56 2kvm2 wdmd[1732]: test warning now 140528 ping 140516 close
> 140526 renewal 140451 expire 140531 client 1673
> sanlock_1603cd90-92ef-4c03-922c-cecb282fd00e:1
> Mar 10 10:37:57 2kvm2 sanlock[1673]: 2017-03-10 10:37:57+0100 140528 [1673]:
> s3 check_our_lease warning 77 last_success 140451
> Mar 10 10:37:57 2kvm2 wdmd[1732]: test warning now 140529 ping 140516 close
> 140526 renewal 140451 expire 140531 client 1673
> sanlock_1603cd90-92ef-4c03-922c-cecb282fd00e:1
> Mar 10 10:37:58 2kvm2 sanlock[1673]: 2017-03-10 10:37:58+0100 140529 [1673]:
> s3 check_our_lease warning 78 last_success 140451
> Mar 10 10:37:58 2kvm2 wdmd[1732]: test warning now 140530 ping 140516 close
> 140526 renewal 140451 expire 140531 client 1673
> sanlock_1603cd90-92ef-4c03-922c-cecb282fd00e:1
> Mar 10 10:37:59 2kvm2 sanlock[1673]: 2017-03-10 10:37:59+0100 140530 [1673]:
> s3 check_our_lease warning 79 last_success 140451
> Mar 10 10:37:59 2kvm2 wdmd[1732]: test failed rem 55 now 140531 ping 140516
> close 140526 renewal 140451 expire 140531 client 1673
> sanlock_1603cd90-92ef-4c03-922c-cecb282fd00e:1
> Mar 10 10:38:00 2kvm2 sanlock[1673]: 2017-03-10 10:38:00+0100 140531 [1673]:
> s3 check_our_lease failed 80

Sanlock fail to renew the lease after 80 seconds - game over

> Mar 10 10:38:00 2kvm2 sanlock[1673]: 2017-03-10 10:38:00+0100 140531 [1673]:
> s3 all pids clear

If this host is the SPM, sanlock just killed vdsm, this explains why
your storage operation fail.

> Mar 10 10:38:01 2kvm2 wdmd[1732]: /dev/watchdog0 reopen
> Mar 10 10:38:10 2kvm2 journal: Cannot start job (query, none) for domain
> TEST-LBS_EBSAPP; current job is (query, none) owned by (3284
> remoteDispatchConnectGetAllDomainStats, 0 <null>) for (62s, 0s)
> Mar 10 10:38:10 2kvm2 journal: Timed out during operation: cannot acquire
> state change lock (held by remoteDispatchConnectGetAllDomainStats)
> Mar 10 10:38:11 2kvm2 journal: vdsm vds.dispatcher ERROR SSL error receiving
> from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 40590, 0, 0) at
> 0x3acdd88>: unexpected eof
> Mar 10 10:38:40 2kvm2 journal: Cannot start job (query, none) for domain
> TEST1-LBS_ATRYA; current job is (query, none) owned by (3288
> remoteDispatchConnectGetAllDomainStats, 0 <null>) for (47s, 0s)
> Mar 10 10:38:40 2kvm2 journal: Timed out during operation: cannot acquire
> state change lock (held by remoteDispatchConnectGetAllDomainStats)
> Mar 10 10:38:41 2kvm2 journal: vdsm vds.dispatcher ERROR SSL error receiving
> from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 40592, 0, 0) at
> 0x3fd5b90>: unexpected eof
> Mar 10 10:39:10 2kvm2 journal: Cannot start job (query, none) for domain
> TEST-LBS_EBSAPP; current job is (query, none) owned by (3284
> remoteDispatchConnectGetAllDomainStats, 0 <null>) for (122s, 0s)
> Mar 10 10:39:10 2kvm2 journal: Timed out during operation: cannot acquire
> state change lock (held by remoteDispatchConnectGetAllDomainStats)
> Mar 10 10:39:10 2kvm2 journal: Cannot start job (query, none) for domain
> TEST1-LBS_ATRYA; current job is (query, none) owned by (3288
> remoteDispatchConnectGetAllDomainStats, 0 <null>) for (77s, 0s)
> Mar 10 10:39:10 2kvm2 journal: Timed out during operation: cannot acquire
> state change lock (held by remoteDispatchConnectGetAllDomainStats)
> Mar 10 10:39:11 2kvm2 journal: vdsm vds.dispatcher ERROR SSL error receiving
> from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 40594, 0, 0) at
> 0x2447290>: unexpected eof
> Mar 10 10:39:23 2kvm2 sanlock[1673]: 2017-03-10 10:39:23+0100 140615 [2576]:
> s3 delta_renew write time 140 error -202
> Mar 10 10:39:23 2kvm2 sanlock[1673]: 2017-03-10 10:39:23+0100 140615 [2576]:
> s3 renewal error -202 delta_length 144 last_success 140451
> Mar 10 10:39:40 2kvm2 journal: Cannot start job (query, none) for domain
> TEST-LBS_EBSAPP; current job is (query, none) owned by (3284
> remoteDispatchConnectGetAllDomainStats, 0 <null>) for (152s, 0s)
> Mar 10 10:39:40 2kvm2 journal: Timed out during operation: cannot acquire
> state change lock (held by remoteDispatchConnectGetAllDomainStats)
> Mar 10 10:39:40 2kvm2 journal: Cannot start job (query, none) for domain
> TEST1-LBS_ATRYA; current job is (query, none) owned by (3288
> remoteDispatchConnectGetAllDomainStats, 0 <null>) for (107s, 0s)
> Mar 10 10:39:40 2kvm2 journal: Timed out during operation: cannot acquire
> state change lock (held by remoteDispatchConnectGetAllDomainStats)
> Mar 10 10:39:41 2kvm2 journal: vdsm vds.dispatcher ERROR SSL error receiving
> from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 40596, 0, 0) at
> 0x2472ef0>: unexpected eof
> Mar 10 10:39:49 2kvm2 kernel: INFO: task qemu-img:42107 blocked for more
> than 120 seconds.

qemu-img is blocked for more than 120 seconds.

> Mar 10 10:39:49 2kvm2 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Mar 10 10:39:49 2kvm2 kernel: qemu-img        D ffff88010dad3e30     0 42107
> 3631 0x00000080
> Mar 10 10:39:49 2kvm2 kernel: ffff88010dad3b30 0000000000000082
> ffff8814491f4e70 ffff88010dad3fd8
> Mar 10 10:39:49 2kvm2 kernel: ffff88010dad3fd8 ffff88010dad3fd8
> ffff8814491f4e70 ffff88301f096c40
> Mar 10 10:39:49 2kvm2 kernel: 0000000000000000 7fffffffffffffff
> ffff88181f186c00 ffff88010dad3e30
> Mar 10 10:39:49 2kvm2 kernel: Call Trace:
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff8168bbb9>] schedule+0x29/0x70
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff81689609>]
> schedule_timeout+0x239/0x2d0
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff8168b15e>]
> io_schedule_timeout+0xae/0x130
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff8168b1f8>] io_schedule+0x18/0x20
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff8124d9e5>]
> wait_on_sync_kiocb+0x35/0x80
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffffa0a36091>]
> fuse_direct_IO+0x231/0x380 [fuse]
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff812a6ddd>] ?
> cap_inode_need_killpriv+0x2d/0x40
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff812a8cb6>] ?
> security_inode_need_killpriv+0x16/0x20
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff81219e3f>] ?
> dentry_needs_remove_privs.part.13+0x1f/0x30
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff81182a2d>]
> generic_file_direct_write+0xcd/0x190
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffffa0a36905>]
> fuse_file_aio_write+0x185/0x340 [fuse]
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff811fdabd>] do_sync_write+0x8d/0xd0
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff811fe32d>] vfs_write+0xbd/0x1e0
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff811ff002>] SyS_pwrite64+0x92/0xc0
> Mar 10 10:39:49 2kvm2 kernel: [<ffffffff81696b09>]
> system_call_fastpath+0x16/0x1b
> Mar 10 10:39:49 2kvm2 kernel: INFO: task qemu-img:42111 blocked for more
> than 120 seconds.
> Mar 10 10:39:49 2kvm2 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Mar 10 10:39:49 2kvm2 kernel: qemu-img        D ffff8818a76e7e30     0 42111
> 3632 0x00000080
> Mar 10 10:39:49 2kvm2 kernel: ffff8818a76e7b30 0000000000000082
> ffff88188aaeaf10 ffff8818a76e7fd8
> Mar 10 10:39:49 2kvm2 kernel: ffff8818a76e7fd8 ffff8818a76e7fd8
> ffff88188aaeaf10 ffff88301f156c40
>
> memory
> =======
>
> # cat /proc/meminfo
> MemTotal:       197983472 kB
> MemFree:          834228 kB
> MemAvailable:   165541204 kB
> Buffers:           45548 kB
> Cached:         159596272 kB
> SwapCached:       119872 kB
> Active:         40803264 kB
> Inactive:       148022076 kB
> Active(anon):   26594112 kB
> Inactive(anon):  2626384 kB
> Active(file):   14209152 kB
> Inactive(file): 145395692 kB
> Unevictable:       50488 kB
> Mlocked:           50488 kB
> SwapTotal:       4194300 kB
> SwapFree:        3612188 kB
> Dirty:               624 kB
> Writeback:             0 kB
> AnonPages:      29185032 kB
> Mapped:            85176 kB
> Shmem:             25908 kB
> Slab:            6203384 kB
> SReclaimable:    5857240 kB
> SUnreclaim:       346144 kB
> KernelStack:       19184 kB
> PageTables:        86100 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    103186036 kB
> Committed_AS:   52300288 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:     1560580 kB
> VmallocChunk:   34257341440 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:   5566464 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:      431292 kB
> DirectMap2M:    19382272 kB
> DirectMap1G:    183500800 kB
>
>
> can anybody help me with this ??
> I've got a small tip about swap problem ( in messages), but not shure .....
> The similar problem occured in older versions in gluster/ovirt testing ( in
> huge workload freeez - but not fatal overload )

You have a storage issue, you should understand why
your storage is failing.

There is also kernel failure to allocated page, maybe this is
related to the storage failure?

Nir


More information about the Users mailing list