[ovirt-devel] Migration failed

Michal Skrivanek michal.skrivanek at redhat.com
Tue Dec 19 11:36:52 UTC 2017


> On 19 Dec 2017, at 10:14, Arik Hadas <ahadas at redhat.com> wrote:
> 
> 
> 
> On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek <michal.skrivanek at redhat.com <mailto:michal.skrivanek at redhat.com>> wrote:
> 
> > On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal at redhat.com <mailto:mzamazal at redhat.com>> wrote:
> >
> > Yedidyah Bar David <didi at redhat.com <mailto:didi at redhat.com>> writes:
> >
> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit at ovirt.org <mailto:gerrit at ovirt.org>> wrote:
> >>> Jenkins CI posted comments on this change.
> >>>
> >>
> >>> View Change
> >>>
> >>> Patch set 3:Continuous-Integration -1
> >>>
> >>> Build Failed
> >>>
> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/ <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/>
> >>> : FAILURE
> >>
> >> Console output of above job says:
> >>
> >> 08:13:34   # migrate_vm:
> >> 08:16:37     * Collect artifacts:
> >> 08:16:40     * Collect artifacts: Success (in 0:00:03)
> >> 08:16:40   # migrate_vm: Success (in 0:03:06)
> >> 08:16:40   # Results located at
> >> /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit.xml
> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)
> >> 08:16:40 Error occured, aborting
> >>
> >> The file 006_migrations.py.junit.xml [1] says:
> >>
> >> <failure type="exceptions.AssertionError" message="False != True after
> >> 180 seconds">
> >
> > Reading the logs, I can see the VM migrates normally and seems to be
> > reported to Engine correctly.  When Engine receives end-of-migration
> > event, it sends Destroy to the source (which is correct), calls dumpxmls
> > on the destination in the meantime (looks fine to me) and then calls
> 
> looks like a race between getallvmstats reporting VM as Down (statusTime: 4296271980) being processed, while there is a Down/MigrationSucceeded event arriving (with notify_time 4296273170) at about the same time
> Unfortunately the vdsm.log is not in DEBUG level so there’s very little information as to why and what exactly did it send out.
> @infra - can you enable debug log level for vdsm by default? 
> 
> It does look like a race to me - does it reproduce? 
> 
> > Destroy on the destination, which is weird and I don't understand why
> > the Destroy is invoked.
> >
> > Arik, would you like to take a look?  Maybe I overlooked something or
> > maybe there's a bug.  The logs are at
> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/ <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/>
> > and the interesting things happen around 2017-12-18 03:13:43,758-05.
> 
> So it looks like that:
> 1. the engine polls the VMs from the source host
> 2. right after #1 we get the down event with proper exit reason (= migration succeeded) but the engine doesn't process it since the VM is being locked by the monitoring as part of processing that polling (to prevent two analysis of the same VM simultaneously).
> 3. the result of the polling is a VM in status Down and must probably exit_status=Normal
> 4. the engine decides to abort the migration and thus the monitoring thread of the source host destroys the VM on the destination host.
> 
> Unfortunately we don't have the exit_reason that is returned by the polling.
> However, the only option I can think of is that it is different than MigrationSucceeded, because otherwise we would have hand-over the VM to the destination host rather than aborting the migration [1].
> That part of the code recently changed as part of [2] - we used to hand-over the VM when we get from the source host:
> status = Down + exit_status = Normal 
> And in the database: previous_status = MigrationFrom
> But after that change we require:
> status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded **
> And in the database: previous_status = MigrationFrom
> 
> Long story short, is it possible that VDSM had set the status of the VM to Down and exit_status to Normal but the exit_reason was not updated (yet?) to MigrationSucceeded?

ok, so there might be a plausible explanation
the guest drive mapping introduced a significant delay into the VM.getStats call since it tries to update the mapping when it detects a change. That is likely to happen on lifecycle changes. In the OST case it took 1.2s to finish the whole call, and in the meantime the migration has finished. The getStats() call is not written with possible state change in mind, so if it so happens and the state moves from anything to Down in the middle of it it returns a Down state without exitCode and exitReason which confuses engine. We started to use the exitReason code to differentiate the various flavors of Down in engine in ~4.1 and in this case it results in misleading “VM powered off by admin” case

we need to fix the VM.getStats() to handle VM state changes in the middle
we need to fix the guest drive mapping updates to handle cleanly situations when the VM is either not ready yet or already gone

Thanks,
michal

> 
> [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java#L291 <https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java#L291>
> [2] https://gerrit.ovirt.org/#/c/84387/ <https://gerrit.ovirt.org/#/c/84387/>
>  
> >
> >> Can someone please have a look? Thanks.
> >>
> >> As a side note, if indeed this is the cause for the failure for the
> >> job, it's misleading to say "migrate_vm: Success".
> >>
> >> [1]
> >> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migrations.py.junit.xml <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migrations.py.junit.xml>
> >>
> >>>
> >>> To view, visit change 85177. To unsubscribe, visit settings.
> >>>
> >>> Gerrit-Project: ovirt-system-tests
> >>> Gerrit-Branch: master
> >>> Gerrit-MessageType: comment
> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5
> >>> Gerrit-Change-Number: 85177
> >>> Gerrit-PatchSet: 3
> >>> Gerrit-Owner: Yedidyah Bar David <didi at redhat.com <mailto:didi at redhat.com>>
> >>> Gerrit-Reviewer: Dafna Ron <dron at redhat.com <mailto:dron at redhat.com>>
> >>> Gerrit-Reviewer: Eyal Edri <eedri at redhat.com <mailto:eedri at redhat.com>>
> >>> Gerrit-Reviewer: Jenkins CI
> >>> Gerrit-Reviewer: Sandro Bonazzola <sbonazzo at redhat.com <mailto:sbonazzo at redhat.com>>
> >>> Gerrit-Reviewer: Yedidyah Bar David <didi at redhat.com <mailto:didi at redhat.com>>
> >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000
> >>> Gerrit-HasComments: No
> > _______________________________________________
> > Devel mailing list
> > Devel at ovirt.org <mailto:Devel at ovirt.org>
> > http://lists.ovirt.org/mailman/listinfo/devel <http://lists.ovirt.org/mailman/listinfo/devel>
> >
> >
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20171219/80dd0a52/attachment.html>


More information about the Devel mailing list