[ovirt-devel] [ OST Failure Report ] [ oVirt master ] [ 03-07-2017 ] [ 006_migrations.migrate_vm ]

Michal Skrivanek michal.skrivanek at redhat.com
Tue Jul 4 11:25:43 UTC 2017


> On 4 Jul 2017, at 13:00, Eyal Edri <eedri at redhat.com> wrote:
> 
> I was able to reproduce the error [1] on a manual run with only new vdsm from [2],
> and also to verify that w/o this change, while using latest tested run [3] it works.
> 
> So I think this proves quite clearly the problem is one of the latest VDSM patches.

There is only a single patch between vdsms [1] and [3]
https://gerrit.ovirt.org/#/c/78536

> 
> I'm running again the test with the suspected bad VDSM and hopefully will be able to extract the env to tar.gz file
> which anyone can import using the lago demo tool.
> 
> 
> 
> [1] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/748/ <http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/748/>
> [2] http://jenkins.ovirt.org/job/vdsm_master_build-artifacts-el7-x86_64/2694/ <http://jenkins.ovirt.org/job/vdsm_master_build-artifacts-el7-x86_64/2694/>
> [3] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/747/ <http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/747/>
> 
> 
> 
> On Tue, Jul 4, 2017 at 1:30 PM, Nadav Goldin <ngoldin at redhat.com <mailto:ngoldin at redhat.com>> wrote:
> Hi, sorry for posting late, I had a brief look at this yesterday:
> 1. I couldn't replicate it locally - which means it is most likely a
> recent change.
> 2. I looked at the libvirt XMLs Lago generatd for the hosts, as a new
> version is used this week(0.40) - and they seem OK - specifically
> memroy and vcpus(which was my initial suspect).
> 3. I saw two Engine patches, a bit prior to the time it started to
> fail, which *might* in my common sense be related, but it is out of my
> scope to tell(CC'ed patch owners):
> 
> core: Make VmAnalyzer to treat a migrated Paused VM as success -
> https://gerrit.ovirt.org/78305 <https://gerrit.ovirt.org/78305>
> 
> fix custom fencing default config setting
> https://gerrit.ovirt.org/78720 <https://gerrit.ovirt.org/78720>
> 
> Shot in the wild - Could it be that the 'CPUOverload' filter was not
> active before for some reason?
> 
> Also, there are some exceptions in host0 vdsm log[1], failing to get
> VM stats, though I can't tell if they are specific to this failure.
> 
> Of course this is not a complete analysis, I hope it helps.
> 
> 
> [1] http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log <http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log>
> 
> 
> Nadav.
> 
> 
> 
> 
> 
> On Tue, Jul 4, 2017 at 12:46 PM, Eyal Edri <eedri at redhat.com <mailto:eedri at redhat.com>> wrote:
> >
> >
> > On Tue, Jul 4, 2017 at 12:18 PM, Michal Skrivanek
> > <michal.skrivanek at redhat.com <mailto:michal.skrivanek at redhat.com>> wrote:
> >>
> >>
> >> On 3 Jul 2017, at 15:35, Shlomo Ben David <sbendavi at redhat.com <mailto:sbendavi at redhat.com>> wrote:
> >>
> >> Hi,
> >>
> >> Test failed: [ 006_migrations.migrate_vm ]
> >> Link to suspected patches: N/A
> >> Link to Job:
> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/ <http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/>
> >> Link to all logs:
> >> Error snippet from the log:
> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/ <http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/>
> >>
> >> <error>
> >>
> >>  "Fault reason is "Operation Failed". Fault detail is "[Cannot migrate VM.
> >> There is no host that satisfies current scheduling constraints. See below
> >> for details:, The host lago-basic-suite-master-host0 did not satisfy
> >> internal filter CPUOverloaded because its CPU is too loaded.]"
> >>
> >> </error>
> >>
> >> <engine log>
> >>
> >> 2017-07-02 16:43:22,829-04 INFO
> >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
> >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock Acquired to object
> >> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]',
> >> sharedLocks=''}'
> >> 2017-07-02 16:43:22,833-04 DEBUG
> >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
> >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored
> >> procedure. Call string is [{call getdiskvmelementspluggedtovm(?)}]
> >> 2017-07-02 16:43:22,833-04 DEBUG
> >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
> >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
> >> procedure [GetDiskVmElementsPluggedToVm] compiled
> >> 2017-07-02 16:43:22,843-04 DEBUG
> >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
> >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored
> >> procedure. Call string is [{call getattacheddisksnapshotstovm(?, ?)}]
> >> 2017-07-02 16:43:22,843-04 DEBUG
> >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
> >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
> >> procedure [GetAttachedDiskSnapshotsToVm] compiled
> >> 2017-07-02 16:43:22,919-04 INFO
> >> [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-27)
> >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Candidate host
> >> 'lago-basic-suite-master-host0' ('46bdc63d-98f5-4eee-81aa-2fb88b8f7cbe') was
> >> filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'CPUOverloaded'
> >> (correlation id: null)
> >> 2017-07-02 16:43:22,920-04 WARN
> >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
> >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Validation of action
> >> 'MigrateVmToServer' failed for user admin at internal-authz. Reasons:
> >> VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
> >> lago-basic-suite-master-host0,$filterName
> >> CPUOverloaded,VAR__DETAIL__CPU_OVERLOADED,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL
> >>
> >>
> >>
> >> This has nothing to do with migration
> >> The CPUOverload is a scheduling policy, unless there was any change in
> >> that area the obvious explanation would be that the host has a CPU overload
> >> condition.
> >> I briefly looked at logs and see ""cpuUser": "83.40", "cpuSys": "16.59",
> >> "cpuIdle": “0.08”” which indeed suggests an overload, from the same sample I
> >> can see it’s vdsm ("cpuUserVdsmd": “77.38”, cpuSysVdsmd": “18.44"
> >>
> >> Since similar values are consistently being reported for some time, and
> >> there is a setupNetworks and storage rescan prior to the the failure, and
> >> there is no other indication of anything wrong, I’d just say the environment
> >> or the order of tests or timing has changed, but nothing wrong with the
> >> oVirt code
> >> Did any of that changed recently? Does it reproduce locally?
> >
> >
> > AFAIK, no significant environment changes or tests were done.
> > We will try to reproduce it locally and also on the manual job,  but from
> > what it looks it is very consistent (unlike other race failures we've seen
> > lately ) and continues to fails on the same tests, so its either a change in
> > oVirt or something else that we're not thinking on.
> >
> >>
> >>
> >> Thanks,
> >> michal
> >>
> >> 2017-07-02 16:43:22,920-04 INFO
> >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
> >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock freed to object
> >> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]',
> >> sharedLocks=''}'
> >> 2017-07-02 16:43:22,929-04 DEBUG
> >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
> >> (DefaultQuartzScheduler7) [] Rescheduling
> >> DEFAULT.org.ovirt.engine.core.bll.ColdRebootAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775733
> >> as there is no unfired trigger.
> >> 2017-07-02 16:43:22,932-04 ERROR
> >> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default
> >> task-27) [] Operation Failed: [Cannot migrate VM. There is no host that
> >> satisfies current scheduling constraints. See below for details:, The host
> >> lago-basic-suite-master-host0 did not satisfy internal filter CPUOverloaded
> >> because its CPU is too loaded.]
> >> 2017-07-02 16:43:23,331-04 DEBUG
> >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
> >> (DefaultQuartzScheduler2) [] Rescheduling
> >> DEFAULT.org.ovirt.engine.core.bll.HaAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775793
> >> as there is no unfired trigger.
> >> 2017-07-02 16:43:23,332-04 DEBUG
> >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
> >> (DefaultQuartzScheduler2) [] Rescheduling
> >> DEFAULT.org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods#-9223372036854775783
> >> as there is no unfired trigger.
> >>
> >> <engine log>
> >>
> >>
> >>
> >> Best Regards,
> >>
> >> Shlomi Ben-David | Software Engineer | Red Hat ISRAEL
> >> RHCSA | RHCVA | RHCE
> >> IRC: shlomibendavid (on #rhev-integ, #rhev-dev, #rhev-ci)
> >>
> >> OPEN SOURCE - 1 4 011 && 011 4 1
> >>
> >> _______________________________________________
> >> Devel mailing list
> >> Devel at ovirt.org <mailto:Devel at ovirt.org>
> >> http://lists.ovirt.org/mailman/listinfo/devel <http://lists.ovirt.org/mailman/listinfo/devel>
> >>
> >>
> >>
> >> _______________________________________________
> >> Devel mailing list
> >> Devel at ovirt.org <mailto:Devel at ovirt.org>
> >> http://lists.ovirt.org/mailman/listinfo/devel <http://lists.ovirt.org/mailman/listinfo/devel>
> >
> >
> >
> >
> > --
> >
> > Eyal edri
> >
> >
> > ASSOCIATE MANAGER
> >
> > RHV DevOps
> >
> > EMEA VIRTUALIZATION R&D
> >
> >
> > Red Hat EMEA
> >
> > TRIED. TESTED. TRUSTED.
> > phone: +972-9-7692018 <tel:%2B972-9-7692018>
> > irc: eedri (on #tlv #rhev-dev #rhev-integ)
> >
> > _______________________________________________
> > Devel mailing list
> > Devel at ovirt.org <mailto:Devel at ovirt.org>
> > http://lists.ovirt.org/mailman/listinfo/devel <http://lists.ovirt.org/mailman/listinfo/devel>
> 
> 
> 
> -- 
> EYAL EDRI
> 
> ASSOCIATE MANAGER
> RHV DEVOPS
> EMEA VIRTUALIZATION R&D
> 
> Red Hat EMEA <https://www.redhat.com/>
>  <https://red.ht/sig>	TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
> phone: +972-9-7692018 <tel:+972%209-769-2018>
> irc: eedri (on #tlv #rhev-dev #rhev-integ)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20170704/25c5e48e/attachment-0001.html>


More information about the Devel mailing list