[ovirt-users] USER_CREATE_SNAPSHOT_FINISHED_FAILURE with Cinder storage stuck
Maor Lipchuk
mlipchuk at redhat.com
Wed Aug 23 07:49:43 UTC 2017
On Tue, Aug 22, 2017 at 1:08 PM, Matthias Leopold
<matthias.leopold at meduniwien.ac.at> wrote:
>
>
> Am 2017-08-22 um 09:33 schrieb Maor Lipchuk:
>>
>> On Mon, Aug 21, 2017 at 6:12 PM, Matthias Leopold
>> <matthias.leopold at meduniwien.ac.at> wrote:
>>>
>>> Hi,
>>>
>>> we're experimenting with Cinder/Ceph Storage on oVirt 4.1.3. When we
>>> tried
>>> to snapshot a VM (2 disks on Cinder storage domain) the task never
>>> finished
>>> and now seems to be in an uninterruptible loop. We tried to stop it in
>>> various (brute force) ways, but the below messages (one of the disks as
>>> an
>>> example) are cluttering engine.log every 10 seconds. We tried the
>>> following:
>>>
>>> - deleting the VM
>>> - restarting ovirt-engine service
>>> - vdsClient -s 0 getAllTasksStatuses on SPM host (no result)
>>> - restarting vdsmd service on SPM host
>>> - /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -u engine -d
>>> engine
>>> -c c841c979-70ea-4e06-b9c4-9c5ce014d76d
>>>
>>> None of this helped. How do we get rid of this failed transaction?
>>>
>>> thx
>>> matthias
>>>
>>> 2017-08-21 16:40:44,798+02 INFO
>>> [org.ovirt.engine.core.utils.transaction.TransactionSupport]
>>> (DefaultQuartzScheduler7) [080af640-bac3-4990-8bf4-6829551b538d]
>>> transaction
>>> rolled back
>>> 2017-08-21 16:40:44,799+02 ERROR
>>> [org.ovirt.engine.core.bll.job.ExecutionHandler]
>>> (DefaultQuartzScheduler7)
>>> [080af640-bac3-4990-8bf4-6829551b538d] Exception:
>>> org.springframework.dao.DataIntegrityViolationException:
>>> CallableStatementCallback; SQL [{call insertstep(?, ?, ?, ?, ?, ?, ?, ?,
>>> ?,
>>> ?, ?, ?, ?, ?)}]; ERROR: insert or update on table "step" violates
>>> foreign
>>> key constraint "fk_step_job"
>>> 2017-08-21 16:40:44,805+02 ERROR
>>> [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand]
>>> (DefaultQuartzScheduler7) [080af640-bac3-4990-8bf4-6829551b538d] Ending
>>> command
>>> 'org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand'
>>> with
>>> failure.
>>> 2017-08-21 16:40:44,807+02 WARN
>>> [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand]
>>> (DefaultQuartzScheduler7) [080af640-bac3-4990-8bf4-6829551b538d] No
>>> snapshot
>>> was created for VM 'c0235316-81c4-48be-9521-b86b338c7d20' which is in
>>> LOCKED
>>> status
>>> 2017-08-21 16:40:44,810+02 INFO
>>> [org.ovirt.engine.core.utils.transaction.TransactionSupport]
>>> (DefaultQuartzScheduler7) [080af640-bac3-4990-8bf4-6829551b538d]
>>> transaction
>>> rolled back
>>> 2017-08-21 16:40:44,810+02 WARN
>>> [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>>> (DefaultQuartzScheduler7) [080af640-bac3-4990-8bf4-6829551b538d] Trying
>>> to
>>> release exclusive lock which does not exist, lock key:
>>> 'c0235316-81c4-48be-9521-b86b338c7d20VM'
>>> 2017-08-21 16:40:44,810+02 INFO
>>> [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand]
>>> (DefaultQuartzScheduler7) [080af640-bac3-4990-8bf4-6829551b538d] Lock
>>> freed
>>> to object
>>> 'EngineLock:{exclusiveLocks='[c0235316-81c4-48be-9521-b86b338c7d20=VM]',
>>> sharedLocks=''}'
>>> 2017-08-21 16:40:44,829+02 ERROR
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (DefaultQuartzScheduler7) [080af640-bac3-4990-8bf4-6829551b538d]
>>> EVENT_ID:
>>> USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Correlation ID:
>>> 080af640-bac3-4990-8bf4-6829551b538d, Job ID:
>>> a3be8af1-8d33-4d35-9672-215ac7c9959f, Call Stack: null, Custom Event ID:
>>> -1,
>>> Message: Failed to complete snapshot 'test' creation for VM '<UNKNOWN>'.
>>> 2017-08-21 16:40:44,829+02 ERROR
>>> [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller]
>>> (DefaultQuartzScheduler7) [080af640-bac3-4990-8bf4-6829551b538d] Failed
>>> invoking callback end method 'onFailed' for command
>>> 'c841c979-70ea-4e06-b9c4-9c5ce014d76d' with exception 'null', the
>>> callback
>>> is marked for end method retries
>>>
>>>
>
>>
>>
>> Hi Matthias,
>>
>> Can you please attach the full engine log contains the first error
>> occurred so we can trace its origin and fix it?
>> Does it reproduced constantly?
>>
>> The engine does not use VDSM tasks to manage Cinder, the engine use
>> Cinder as an external provider using the COCO infrastructure for async
>> tasks.
>> The COCO tasks are managed in the database using the command_entities
>> table, basically if you will remove all references of the command id
>> from the command_entities and restart engine you should not see it any
>> more.
>>
>> Regards,
>> Maor
>>
>
> Hi Maor,
>
> thanks very much for replying. First i tried to clean the command_entities
> table plus restarting engine as you suggested. This didn't work entirely,
> these two entries
Can you please try first to stop the engine and only then clean the
command_entities.
>
> engine=# select command_id, command_type, root_command_id, status from
> command_entities;
> command_id | command_type | root_command_id
> | status
> --------------------------------------+--------------+--------------------------------------+--------
> c841c979-70ea-4e06-b9c4-9c5ce014d76d | 206 |
> c841c979-70ea-4e06-b9c4-9c5ce014d76d | FAILED
> 65fa094e-1609-47ea-bf0d-611e3d5b9358 | 206 |
> 65fa094e-1609-47ea-bf0d-611e3d5b9358 | FAILED
>
>
> keep appearing and still cause messages in engine.log like
>
> 2017-08-22 11:54:57,109+02 WARN
> [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand]
> (DefaultQuartzScheduler8) [080af640-bac3-4990-8bf4-6829551b538d] No snapshot
> was created for VM 'c0235316-81c4-48be-9521-b86b338c7d20' which is in LOCKED
> status
> 2017-08-22 11:54:57,112+02 INFO
> [org.ovirt.engine.core.utils.transaction.TransactionSupport]
> (DefaultQuartzScheduler8) [080af640-bac3-4990-8bf4-6829551b538d] transaction
> rolled back
> 2017-08-22 11:54:57,112+02 WARN
> [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
> (DefaultQuartzScheduler8) [080af640-bac3-4990-8bf4-6829551b538d] Trying to
> release exclusive lock which does not exist, lock key:
> 'c0235316-81c4-48be-9521-b86b338c7d20VM'
> 2017-08-22 11:54:57,112+02 INFO
> [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand]
> (DefaultQuartzScheduler8) [080af640-bac3-4990-8bf4-6829551b538d] Lock freed
> to object
> 'EngineLock:{exclusiveLocks='[c0235316-81c4-48be-9521-b86b338c7d20=VM]',
> sharedLocks=''}'
> 2017-08-22 11:54:57,131+02 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler8) [080af640-bac3-4990-8bf4-6829551b538d] EVENT_ID:
> USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Correlation ID:
> 080af640-bac3-4990-8bf4-6829551b538d, Call Stack: null, Custom Event ID: -1,
> Message: Failed to complete snapshot 'test' creation for VM '<UNKNOWN>'.
> 2017-08-22 11:54:57,132+02 ERROR
> [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller]
> (DefaultQuartzScheduler8) [080af640-bac3-4990-8bf4-6829551b538d] Failed
> invoking callback end method 'onFailed' for command
> 'c841c979-70ea-4e06-b9c4-9c5ce014d76d' with exception 'null', the callback
> is marked for end method retries
> 2017-08-22 11:54:57,138+02 ERROR
> [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand]
> (DefaultQuartzScheduler8) [54caec87-ba59-4e9f-acf3-7fb5d94658e4] Ending
> command
> 'org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand' with
> failure.
>
> The SQL error messages like
>
> ERROR: insert or update on table "step" violates foreign key constraint
> "fk_step_job"
>
> are gone now. Please give me a hint how to get rid of these tasks.
>
> To trace the original error i'm attaching a minute of engine logs when the
> error first occurred (i hope this works when sending to the list). We have
> not tried to reproduce it yet, but we are planning to test cinder/ceph
> storage more extensively, so we'll be looking at this again. At first i
> would feel a lot better when i could clean up the last error completely.
>
> Thanks a lot again
> Best Regards
> Matthias
Thanks for the logs.
If you can add the full engine log maybe I can also look at the reason
of the consistent failing.
More information about the Users
mailing list