[ovirt-users] VMs stuck in migrating state

Mon Mar 5 08:43:32 UTC 2018

El 2018-03-02 15:34, Milan Zamazal escribió:
> nicolas at devels.es writes:
> 
>> El 2018-03-02 14:10, Milan Zamazal escribió:
>>> nicolas at devels.es writes:
>>> 
>>>> We're running 4.1.9 and during the weekend we had a storage issue 
>>>> that
>>>> seemed
>>>> to leave some hosts in an strange state. One of the hosts has almost 
>>>> all VMs
>>>> migrating (although it seems to not actually being migrating them) 
>>>> and the
>>>> migration state cannot be cancelled.
>>>> 
>>>> When clicking on one of those machines and selecting 'Cancel 
>>>> migration', in
>>>> the
>>>> ovirt-engine log I see:
>>>> 
>>>> 2018-02-26 08:52:07,588Z INFO
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand]
>>>> (org.ovirt.thread.pool-6-thread-36) 
>>>> [887dfbf9-dece-4f7b-90a8-dac02b849b7f]
>>>> HostName = host2.domain.com
>>>> 2018-02-26 08:52:07,588Z ERROR
>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand]
>>>> (org.ovirt.thread.pool-6-thread-36) 
>>>> [887dfbf9-dece-4f7b-90a8-dac02b849b7f]
>>>> Command 'CancelMigrateVDSCommand(HostName = host2.domain.com,
>>>> CancelMigrationVDSParameters:{runAsync='true',
>>>> hostId='e63b9146-10c4-47ad-bd6c-f053a8c5b4eb',
>>>> vmId='26d37e43-32e2-4e55-9c1e-1438518d5021'})' execution failed:
>>>> VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, 
>>>> error =
>>>> Migration process cancelled, code = 82
>>>> 
>>>> On the vdsm side I see:
>>>> 
>>>> 2018-02-26 08:56:19,396+0000 INFO  (jsonrpc/0) [vdsm.api] START
>>>> migrateCancel()
>>>> from=::ffff:10.X.X.X,54654, 
>>>> flow_id=874d36d7-63f5-4b71-8a4d-6d9f3ec65858
>>>> (api:46)
>>>> 2018-02-26 08:56:19,398+0000 INFO  (jsonrpc/0) [vdsm.api] FINISH
>>>> migrateCancel
>>>> return={'status': {'message': 'Migration process cancelled', 'code': 
>>>> 82},
>>>> 'progress': 0} from=::ffff:10.X.X.X,54654,
>>>> flow_id=874d36d7-63f5-4b71-8a4d-6d9f3ec65858 (api:52)
>>>> 
>>>> So no error on the vdsm side log.
>>> 
>>> Interesting.  The messages above indicate that the VM was attempted 
>>> to
>>> migrate, but the migration got temporarily rejected on the 
>>> destination
>>> due to the number of already running incoming migrations (the limit 
>>> is 2
>>> incoming migrations by default).  Later, Vdsm was asked to cancel the
>>> outgoing migration and it successfully set a migration canceling 
>>> flag.
>>> However the action was reported as an error to Engine, due to hitting
>>> the incoming migration limit on the destination.  Maybe it's a bug, 
>>> I'm
>>> not sure, resulting in minor confusion.  Normally it shouldn't 
>>> matter,
>>> the migration should be canceled shortly after anyway and Engine 
>>> should
>>> be informed about that.
>>> 
>>> However the migration apparently wasn't canceled here.  I can't say 
>>> what
>>> happened without complete Vdsm log.  One of possible reasons is that 
>>> the
>>> migration has been waiting on completion of another migration 
>>> outgoing
>>> from the source (only one outgoing migration at the time is allowed 
>>> by
>>> default).  In any case it seems the migration either wasn't actually
>>> started at all or it just started being set up and that has never 
>>> been
>>> completely finished.
>>> 
>> 
>> I'm attaching the log. Basically the storage backend was restarted by 
>> fencing
>> and then this issue happened. This was on 26/02 at about 08:52 (log 
>> time).
> 
> Thank you for the log, but VMs are already “migrating” at its 
> beginning,
> there had to be some problem already earlier.
> 
>>>> I already tried restarting ovirt-engine but it didn't work.
>>> 
>>> Here the problem is clearly on the Vdsm side.
>>> 
>>>> Could someone shed some light on how to cancel the migration status 
>>>> for
>>>> these
>>>> machines? All of them seem to be running on the same host.
>>> 
>>> Did the VMs get unblocked in the meantime?  I can't know what's the
>> 
>> No, they didn't. They're still in a "Migrating" state.
>> 
>>> actual state of the given VMs without seeing the complete Vdsm log, 
>>> so
>>> it's difficult to give a good advice.  I think that Vdsm restart on 
>>> the
>>> given host would help BUT it's generally not a very good idea to 
>>> restart
>>> Vdsm if any real migration, outgoing or incoming, is running on the
>>> host.  VMs that aren't actually being migrated (despite being 
>>> reported
>>> as migrating) at all should simply return to Up state after the 
>>> restart,
>>> but VMs with any real migration action pending might get return to Up
>>> state without proper cleanup, resulting in a different kind of mess 
>>> or
>>> maybe something even worse (things should improve in oVirt 4.2, but 
>>> it's
>>> still good to avoid Vdsm restarts with migrations running).
>>> 
>> 
>> I assume this is not a real migration as it has been in this state for 
>> several
>> days. Would you advice restarting vdsm in this case then?
> 
> I'd say try it.  Since nothing has changed for several days, restarting
> Vdsm looks like appropriate action at this point.  Just don't make a
> habit of it :-).
> 

Thanks, that made it.

Regards.

> Regards,
> Milan