[ovirt-users] VMs stuck in migrating state

Fri Mar 2 14:25:17 UTC 2018

Hi Milan,

El 2018-03-02 14:10, Milan Zamazal escribió:
> nicolas at devels.es writes:
> 
>> We're running 4.1.9 and during the weekend we had a storage issue that 
>> seemed
>> to leave some hosts in an strange state. One of the hosts has almost 
>> all VMs
>> migrating (although it seems to not actually being migrating them) and 
>> the
>> migration state cannot be cancelled.
>> 
>> When clicking on one of those machines and selecting 'Cancel 
>> migration', in the
>> ovirt-engine log I see:
>> 
>> 2018-02-26 08:52:07,588Z INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand]
>> (org.ovirt.thread.pool-6-thread-36) 
>> [887dfbf9-dece-4f7b-90a8-dac02b849b7f]
>> HostName = host2.domain.com
>> 2018-02-26 08:52:07,588Z ERROR
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand]
>> (org.ovirt.thread.pool-6-thread-36) 
>> [887dfbf9-dece-4f7b-90a8-dac02b849b7f]
>> Command 'CancelMigrateVDSCommand(HostName = host2.domain.com,
>> CancelMigrationVDSParameters:{runAsync='true',
>> hostId='e63b9146-10c4-47ad-bd6c-f053a8c5b4eb',
>> vmId='26d37e43-32e2-4e55-9c1e-1438518d5021'})' execution failed:
>> VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS, 
>> error =
>> Migration process cancelled, code = 82
>> 
>> On the vdsm side I see:
>> 
>> 2018-02-26 08:56:19,396+0000 INFO  (jsonrpc/0) [vdsm.api] START 
>> migrateCancel()
>> from=::ffff:10.X.X.X,54654, 
>> flow_id=874d36d7-63f5-4b71-8a4d-6d9f3ec65858
>> (api:46)
>> 2018-02-26 08:56:19,398+0000 INFO  (jsonrpc/0) [vdsm.api] FINISH 
>> migrateCancel
>> return={'status': {'message': 'Migration process cancelled', 'code': 
>> 82},
>> 'progress': 0} from=::ffff:10.X.X.X,54654,
>> flow_id=874d36d7-63f5-4b71-8a4d-6d9f3ec65858 (api:52)
>> 
>> So no error on the vdsm side log.
> 
> Interesting.  The messages above indicate that the VM was attempted to
> migrate, but the migration got temporarily rejected on the destination
> due to the number of already running incoming migrations (the limit is 
> 2
> incoming migrations by default).  Later, Vdsm was asked to cancel the
> outgoing migration and it successfully set a migration canceling flag.
> However the action was reported as an error to Engine, due to hitting
> the incoming migration limit on the destination.  Maybe it's a bug, I'm
> not sure, resulting in minor confusion.  Normally it shouldn't matter,
> the migration should be canceled shortly after anyway and Engine should
> be informed about that.
> 
> However the migration apparently wasn't canceled here.  I can't say 
> what
> happened without complete Vdsm log.  One of possible reasons is that 
> the
> migration has been waiting on completion of another migration outgoing
> from the source (only one outgoing migration at the time is allowed by
> default).  In any case it seems the migration either wasn't actually
> started at all or it just started being set up and that has never been
> completely finished.
> 

I'm attaching the log. Basically the storage backend was restarted by 
fencing and then this issue happened. This was on 26/02 at about 08:52 
(log time).

>> I already tried restarting ovirt-engine but it didn't work.
> 
> Here the problem is clearly on the Vdsm side.
> 
>> Could someone shed some light on how to cancel the migration status 
>> for these
>> machines? All of them seem to be running on the same host.
> 
> Did the VMs get unblocked in the meantime?  I can't know what's the

No, they didn't. They're still in a "Migrating" state.

> actual state of the given VMs without seeing the complete Vdsm log, so
> it's difficult to give a good advice.  I think that Vdsm restart on the
> given host would help BUT it's generally not a very good idea to 
> restart
> Vdsm if any real migration, outgoing or incoming, is running on the
> host.  VMs that aren't actually being migrated (despite being reported
> as migrating) at all should simply return to Up state after the 
> restart,
> but VMs with any real migration action pending might get return to Up
> state without proper cleanup, resulting in a different kind of mess or
> maybe something even worse (things should improve in oVirt 4.2, but 
> it's
> still good to avoid Vdsm restarts with migrations running).
> 

I assume this is not a real migration as it has been in this state for 
several days. Would you advice restarting vdsm in this case then?

Thank you.

> Regards,
> Milan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vdsm.log.20.xz
Type: application/x-xz
Size: 963208 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20180302/cd436252/attachment.xz>