nicolas(a)devels.es writes:
> El 2018-03-02 14:10, Milan Zamazal escribió:
>> nicolas(a)devels.es writes:
>>
>>> We're running 4.1.9 and during the weekend we had a storage issue
>>> that
>>> seemed
>>> to leave some hosts in an strange state. One of the hosts has almost
>>> all VMs
>>> migrating (although it seems to not actually being migrating them)
>>> and the
>>> migration state cannot be cancelled.
>>>
>>> When clicking on one of those machines and selecting 'Cancel
>>> migration', in
>>> the
>>> ovirt-engine log I see:
>>>
>>> 2018-02-26 08:52:07,588Z INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand]
>>> (org.ovirt.thread.pool-6-thread-36)
>>> [887dfbf9-dece-4f7b-90a8-dac02b849b7f]
>>> HostName =
host2.domain.com
>>> 2018-02-26 08:52:07,588Z ERROR
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.CancelMigrateVDSCommand]
>>> (org.ovirt.thread.pool-6-thread-36)
>>> [887dfbf9-dece-4f7b-90a8-dac02b849b7f]
>>> Command 'CancelMigrateVDSCommand(HostName =
host2.domain.com,
>>> CancelMigrationVDSParameters:{runAsync='true',
>>> hostId='e63b9146-10c4-47ad-bd6c-f053a8c5b4eb',
>>> vmId='26d37e43-32e2-4e55-9c1e-1438518d5021'})' execution failed:
>>> VDSGenericException: VDSErrorException: Failed to CancelMigrateVDS,
>>> error =
>>> Migration process cancelled, code = 82
>>>
>>> On the vdsm side I see:
>>>
>>> 2018-02-26 08:56:19,396+0000 INFO (jsonrpc/0) [vdsm.api] START
>>> migrateCancel()
>>> from=::ffff:10.X.X.X,54654,
>>> flow_id=874d36d7-63f5-4b71-8a4d-6d9f3ec65858
>>> (api:46)
>>> 2018-02-26 08:56:19,398+0000 INFO (jsonrpc/0) [vdsm.api] FINISH
>>> migrateCancel
>>> return={'status': {'message': 'Migration process
cancelled', 'code':
>>> 82},
>>> 'progress': 0} from=::ffff:10.X.X.X,54654,
>>> flow_id=874d36d7-63f5-4b71-8a4d-6d9f3ec65858 (api:52)
>>>
>>> So no error on the vdsm side log.
>>
>> Interesting. The messages above indicate that the VM was attempted
>> to
>> migrate, but the migration got temporarily rejected on the
>> destination
>> due to the number of already running incoming migrations (the limit
>> is 2
>> incoming migrations by default). Later, Vdsm was asked to cancel the
>> outgoing migration and it successfully set a migration canceling
>> flag.
>> However the action was reported as an error to Engine, due to hitting
>> the incoming migration limit on the destination. Maybe it's a bug,
>> I'm
>> not sure, resulting in minor confusion. Normally it shouldn't
>> matter,
>> the migration should be canceled shortly after anyway and Engine
>> should
>> be informed about that.
>>
>> However the migration apparently wasn't canceled here. I can't say
>> what
>> happened without complete Vdsm log. One of possible reasons is that
>> the
>> migration has been waiting on completion of another migration
>> outgoing
>> from the source (only one outgoing migration at the time is allowed
>> by
>> default). In any case it seems the migration either wasn't actually
>> started at all or it just started being set up and that has never
>> been
>> completely finished.
>>
>
> I'm attaching the log. Basically the storage backend was restarted by
> fencing
> and then this issue happened. This was on 26/02 at about 08:52 (log
> time).
Thank you for the log, but VMs are already “migrating” at its
beginning,
there had to be some problem already earlier.
>>> I already tried restarting ovirt-engine but it didn't work.
>>
>> Here the problem is clearly on the Vdsm side.
>>
>>> Could someone shed some light on how to cancel the migration status
>>> for
>>> these
>>> machines? All of them seem to be running on the same host.
>>
>> Did the VMs get unblocked in the meantime? I can't know what's the
>
> No, they didn't. They're still in a "Migrating" state.
>
>> actual state of the given VMs without seeing the complete Vdsm log,
>> so
>> it's difficult to give a good advice. I think that Vdsm restart on
>> the
>> given host would help BUT it's generally not a very good idea to
>> restart
>> Vdsm if any real migration, outgoing or incoming, is running on the
>> host. VMs that aren't actually being migrated (despite being
>> reported
>> as migrating) at all should simply return to Up state after the
>> restart,
>> but VMs with any real migration action pending might get return to Up
>> state without proper cleanup, resulting in a different kind of mess
>> or
>> maybe something even worse (things should improve in oVirt 4.2, but
>> it's
>> still good to avoid Vdsm restarts with migrations running).
>>
>
> I assume this is not a real migration as it has been in this state for
> several
> days. Would you advice restarting vdsm in this case then?
I'd say try it. Since nothing has changed for several days, restarting
Vdsm looks like appropriate action at this point. Just don't make a
habit of it :-).