[ovirt-devel] Migration failed

Yaniv Kaul ykaul at redhat.com
Tue Dec 19 14:13:56 UTC 2017


On Tue, Dec 19, 2017 at 1:36 PM, Michal Skrivanek <
michal.skrivanek at redhat.com> wrote:

>
> On 19 Dec 2017, at 10:14, Arik Hadas <ahadas at redhat.com> wrote:
>
>
>
> On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek <
> michal.skrivanek at redhat.com> wrote:
>
>>
>> > On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal at redhat.com> wrote:
>> >
>> > Yedidyah Bar David <didi at redhat.com> writes:
>> >
>> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit at ovirt.org>
>> wrote:
>> >>> Jenkins CI posted comments on this change.
>> >>>
>> >>
>> >>> View Change
>> >>>
>> >>> Patch set 3:Continuous-Integration -1
>> >>>
>> >>> Build Failed
>> >>>
>> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check
>> -patch-el7-x86_64/2882/
>> >>> : FAILURE
>> >>
>> >> Console output of above job says:
>> >>
>> >> 08:13:34   # migrate_vm:
>> >> 08:16:37     * Collect artifacts:
>> >> 08:16:40     * Collect artifacts: Success (in 0:00:03)
>> >> 08:16:40   # migrate_vm: Success (in 0:03:06)
>> >> 08:16:40   # Results located at
>> >> /dev/shm/ost/deployment-basic-suite-master/default/006_migra
>> tions.py.junit.xml
>> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)
>> >> 08:16:40 Error occured, aborting
>> >>
>> >> The file 006_migrations.py.junit.xml [1] says:
>> >>
>> >> <failure type="exceptions.AssertionError" message="False != True after
>> >> 180 seconds">
>> >
>> > Reading the logs, I can see the VM migrates normally and seems to be
>> > reported to Engine correctly.  When Engine receives end-of-migration
>> > event, it sends Destroy to the source (which is correct), calls dumpxmls
>> > on the destination in the meantime (looks fine to me) and then calls
>>
>> looks like a race between getallvmstats reporting VM as Down (statusTime:
>> 4296271980) being processed, while there is a Down/MigrationSucceeded event
>> arriving (with notify_time 4296273170) at about the same time
>> Unfortunately the vdsm.log is not in DEBUG level so there’s very little
>> information as to why and what exactly did it send out.
>> @infra - can you enable debug log level for vdsm by default?
>
>
>> It does look like a race to me - does it reproduce?
>
>
>> > Destroy on the destination, which is weird and I don't understand why
>> > the Destroy is invoked.
>> >
>> > Arik, would you like to take a look?  Maybe I overlooked something or
>> > maybe there's a bug.  The logs are at
>> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_check
>> -patch-el7-x86_64/2882/artifact/exported-artifacts/basic-
>> suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/
>> > and the interesting things happen around 2017-12-18 03:13:43,758-05.
>>
>
> So it looks like that:
> 1. the engine polls the VMs from the source host
> 2. right after #1 we get the down event with proper exit reason (=
> migration succeeded) but the engine doesn't process it since the VM is
> being locked by the monitoring as part of processing that polling (to
> prevent two analysis of the same VM simultaneously).
> 3. the result of the polling is a VM in status Down and must probably
> exit_status=Normal
> 4. the engine decides to abort the migration and thus the monitoring
> thread of the source host destroys the VM on the destination host.
>
> Unfortunately we don't have the exit_reason that is returned by the
> polling.
> However, the only option I can think of is that it is different than
> MigrationSucceeded, because otherwise we would have hand-over the VM to the
> destination host rather than aborting the migration [1].
> That part of the code recently changed as part of [2] - we used to
> hand-over the VM when we get from the source host:
> status = Down + exit_status = Normal
> And in the database: previous_status = MigrationFrom
> But after that change we require:
> status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded
> **
> And in the database: previous_status = MigrationFrom
>
> Long story short, is it possible that VDSM had set the status of the VM to
> Down and exit_status to Normal but the exit_reason was not updated (yet?)
> to MigrationSucceeded?
>
>
> ok, so there might be a plausible explanation
> the guest drive mapping introduced a significant delay into the
> VM.getStats call since it tries to update the mapping when it detects a
> change. That is likely to happen on lifecycle changes. In the OST case it
> took 1.2s to finish the whole call, and in the meantime the migration has
> finished. The getStats() call is not written with possible state change in
> mind, so if it so happens and the state moves from anything to Down in the
> middle of it it returns a Down state without exitCode and exitReason which
> confuses engine. We started to use the exitReason code to differentiate the
> various flavors of Down in engine in ~4.1 and in this case it results in
> misleading “VM powered off by admin” case
>
> we need to fix the VM.getStats() to handle VM state changes in the middle
> we need to fix the guest drive mapping updates to handle cleanly
> situations when the VM is either not ready yet or already gone
>

Has anyone filed BZs for both? Can we ensure we do? The bottom line result
sounds severe to me.
Y.


> Thanks,
> michal
>
>
> [1] https://github.com/oVirt/ovirt-engine/blob/master/
> backend/manager/modules/vdsbroker/src/main/java/org/
> ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java#L291
> [2] https://gerrit.ovirt.org/#/c/84387/
>
>
>> >
>> >> Can someone please have a look? Thanks.
>> >>
>> >> As a side note, if indeed this is the cause for the failure for the
>> >> job, it's misleading to say "migrate_vm: Success".
>> >>
>> >> [1]
>> >> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check
>> -patch-el7-x86_64/2882/artifact/exported-artifacts/basic-
>> suite-master__logs/006_migrations.py.junit.xml
>> >>
>> >>>
>> >>> To view, visit change 85177. To unsubscribe, visit settings.
>> >>>
>> >>> Gerrit-Project: ovirt-system-tests
>> >>> Gerrit-Branch: master
>> >>> Gerrit-MessageType: comment
>> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5
>> >>> Gerrit-Change-Number: 85177
>> >>> Gerrit-PatchSet: 3
>> >>> Gerrit-Owner: Yedidyah Bar David <didi at redhat.com>
>> >>> Gerrit-Reviewer: Dafna Ron <dron at redhat.com>
>> >>> Gerrit-Reviewer: Eyal Edri <eedri at redhat.com>
>> >>> Gerrit-Reviewer: Jenkins CI
>> >>> Gerrit-Reviewer: Sandro Bonazzola <sbonazzo at redhat.com>
>> >>> Gerrit-Reviewer: Yedidyah Bar David <didi at redhat.com>
>> >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000
>> >>> Gerrit-HasComments: No
>> > _______________________________________________
>> > Devel mailing list
>> > Devel at ovirt.org
>> > http://lists.ovirt.org/mailman/listinfo/devel
>> >
>> >
>>
>>
>
>
> _______________________________________________
> Devel mailing list
> Devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20171219/5082e389/attachment-0001.html>


More information about the Devel mailing list