[ovirt-users] VMs stuck in migrating state
nicolas at devels.es
nicolas at devels.es
Mon Mar 5 08:43:32 UTC 2018
El 2018-03-02 15:34, Milan Zamazal escribió:
> nicolas at devels.es writes:
>
>> El 2018-03-02 14:10, Milan Zamazal escribió:
>>> nicolas at devels.es writes:
>>>
>>>> We're running 4.1.9 and during the weekend we had a storage issue
>>>> that
>>>> seemed
>>>> to leave some hosts in an strange state. One of the hosts has almost
>>>> all VMs
>>>> migrating (although it seems to not actually being migrating them)
>>>> and the
>>>> migration state cannot be cancelled.
>>>>
>>>> When clicking on one of those machines and selecting 'Cancel
>>>> migration', in
>>>> the
>>>> ovirt-engine log I see:
>>>>
>>>> 2018-02-26 08:52:07,588Z INFO
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand]
>>>> (org.ovirt.thread.pool-6-thread-36)
>>>> [887dfbf9-dece-4f7b-90a8-dac02b849b7f]
>>>> HostName = host2.domain.com
>>>> 2018-02-26 08:52:07,588Z ERROR
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand]
>>>> (org.ovirt.thread.pool-6-thread-36)
>>>> [887dfbf9-dece-4f7b-90a8-dac02b849b7f]
>>>> Command 'CancelMigrateVDSCommand(HostName = host2.domain.com,
>>>> CancelMigrationVDSParameters:{runAsync='true',
>>>> hostId='e63b9146-10c4-47ad-bd6c-f053a8c5b4eb',
>>>> vmId='26d37e43-32e2-4e55-9c1e-1438518d5021'})' execution failed:
>>>> VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS,
>>>> error =
>>>> Migration process cancelled, code = 82
>>>>
>>>> On the vdsm side I see:
>>>>
>>>> 2018-02-26 08:56:19,396+0000 INFO (jsonrpc/0) [vdsm.api] START
>>>> migrateCancel()
>>>> from=::ffff:10.X.X.X,54654,
>>>> flow_id=874d36d7-63f5-4b71-8a4d-6d9f3ec65858
>>>> (api:46)
>>>> 2018-02-26 08:56:19,398+0000 INFO (jsonrpc/0) [vdsm.api] FINISH
>>>> migrateCancel
>>>> return={'status': {'message': 'Migration process cancelled', 'code':
>>>> 82},
>>>> 'progress': 0} from=::ffff:10.X.X.X,54654,
>>>> flow_id=874d36d7-63f5-4b71-8a4d-6d9f3ec65858 (api:52)
>>>>
>>>> So no error on the vdsm side log.
>>>
>>> Interesting. The messages above indicate that the VM was attempted
>>> to
>>> migrate, but the migration got temporarily rejected on the
>>> destination
>>> due to the number of already running incoming migrations (the limit
>>> is 2
>>> incoming migrations by default). Later, Vdsm was asked to cancel the
>>> outgoing migration and it successfully set a migration canceling
>>> flag.
>>> However the action was reported as an error to Engine, due to hitting
>>> the incoming migration limit on the destination. Maybe it's a bug,
>>> I'm
>>> not sure, resulting in minor confusion. Normally it shouldn't
>>> matter,
>>> the migration should be canceled shortly after anyway and Engine
>>> should
>>> be informed about that.
>>>
>>> However the migration apparently wasn't canceled here. I can't say
>>> what
>>> happened without complete Vdsm log. One of possible reasons is that
>>> the
>>> migration has been waiting on completion of another migration
>>> outgoing
>>> from the source (only one outgoing migration at the time is allowed
>>> by
>>> default). In any case it seems the migration either wasn't actually
>>> started at all or it just started being set up and that has never
>>> been
>>> completely finished.
>>>
>>
>> I'm attaching the log. Basically the storage backend was restarted by
>> fencing
>> and then this issue happened. This was on 26/02 at about 08:52 (log
>> time).
>
> Thank you for the log, but VMs are already “migrating” at its
> beginning,
> there had to be some problem already earlier.
>
>>>> I already tried restarting ovirt-engine but it didn't work.
>>>
>>> Here the problem is clearly on the Vdsm side.
>>>
>>>> Could someone shed some light on how to cancel the migration status
>>>> for
>>>> these
>>>> machines? All of them seem to be running on the same host.
>>>
>>> Did the VMs get unblocked in the meantime? I can't know what's the
>>
>> No, they didn't. They're still in a "Migrating" state.
>>
>>> actual state of the given VMs without seeing the complete Vdsm log,
>>> so
>>> it's difficult to give a good advice. I think that Vdsm restart on
>>> the
>>> given host would help BUT it's generally not a very good idea to
>>> restart
>>> Vdsm if any real migration, outgoing or incoming, is running on the
>>> host. VMs that aren't actually being migrated (despite being
>>> reported
>>> as migrating) at all should simply return to Up state after the
>>> restart,
>>> but VMs with any real migration action pending might get return to Up
>>> state without proper cleanup, resulting in a different kind of mess
>>> or
>>> maybe something even worse (things should improve in oVirt 4.2, but
>>> it's
>>> still good to avoid Vdsm restarts with migrations running).
>>>
>>
>> I assume this is not a real migration as it has been in this state for
>> several
>> days. Would you advice restarting vdsm in this case then?
>
> I'd say try it. Since nothing has changed for several days, restarting
> Vdsm looks like appropriate action at this point. Just don't make a
> habit of it :-).
>
Thanks, that made it.
Regards.
> Regards,
> Milan
More information about the Users
mailing list