[ovirt-devel] [ OST Failure Report ] [ oVirt master ] [ 03-07-2017 ] [ 006_migrations.migrate_vm ]

Eyal Edri eedri at redhat.com
Tue Jul 4 11:00:04 UTC 2017


I was able to reproduce the error [1] on a manual run with only new vdsm
from [2],
and also to verify that w/o this change, while using latest tested run [3]
it works.

So I think this proves quite clearly the problem is one of the latest VDSM
patches.

I'm running again the test with the suspected bad VDSM and hopefully will
be able to extract the env to tar.gz file
which anyone can import using the lago demo tool.



[1] http://jenkins.ovirt.org/view/oVirt%20system%20tests/
job/ovirt-system-tests_manual/748/
[2] http://jenkins.ovirt.org/job/vdsm_master_build-
artifacts-el7-x86_64/2694/
[3] http://jenkins.ovirt.org/view/oVirt%20system%20tests/
job/ovirt-system-tests_manual/747/



On Tue, Jul 4, 2017 at 1:30 PM, Nadav Goldin <ngoldin at redhat.com> wrote:

> Hi, sorry for posting late, I had a brief look at this yesterday:
> 1. I couldn't replicate it locally - which means it is most likely a
> recent change.
> 2. I looked at the libvirt XMLs Lago generatd for the hosts, as a new
> version is used this week(0.40) - and they seem OK - specifically
> memroy and vcpus(which was my initial suspect).
> 3. I saw two Engine patches, a bit prior to the time it started to
> fail, which *might* in my common sense be related, but it is out of my
> scope to tell(CC'ed patch owners):
>
> core: Make VmAnalyzer to treat a migrated Paused VM as success -
> https://gerrit.ovirt.org/78305
>
> fix custom fencing default config setting
> https://gerrit.ovirt.org/78720
>
> Shot in the wild - Could it be that the 'CPUOverload' filter was not
> active before for some reason?
>
> Also, there are some exceptions in host0 vdsm log[1], failing to get
> VM stats, though I can't tell if they are specific to this failure.
>
> Of course this is not a complete analysis, I hope it helps.
>
>
> [1] http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ma
> ster/7431/artifact/exported-artifacts/basic-suit-master-el7/
> test_logs/basic-suite-master/post-006_migrations.py/lago-
> basic-suite-master-host0/_var_log/vdsm/vdsm.log
>
>
> Nadav.
>
>
>
>
>
> On Tue, Jul 4, 2017 at 12:46 PM, Eyal Edri <eedri at redhat.com> wrote:
> >
> >
> > On Tue, Jul 4, 2017 at 12:18 PM, Michal Skrivanek
> > <michal.skrivanek at redhat.com> wrote:
> >>
> >>
> >> On 3 Jul 2017, at 15:35, Shlomo Ben David <sbendavi at redhat.com> wrote:
> >>
> >> Hi,
> >>
> >> Test failed: [ 006_migrations.migrate_vm ]
> >> Link to suspected patches: N/A
> >> Link to Job:
> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/
> >> Link to all logs:
> >> Error snippet from the log:
> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ma
> ster/7431/artifact/exported-artifacts/basic-suit-master-el7/
> test_logs/basic-suite-master/post-006_migrations.py/
> >>
> >> <error>
> >>
> >>  "Fault reason is "Operation Failed". Fault detail is "[Cannot migrate
> VM.
> >> There is no host that satisfies current scheduling constraints. See
> below
> >> for details:, The host lago-basic-suite-master-host0 did not satisfy
> >> internal filter CPUOverloaded because its CPU is too loaded.]"
> >>
> >> </error>
> >>
> >> <engine log>
> >>
> >> 2017-07-02 16:43:22,829-04 INFO
> >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
> >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock Acquired to object
> >> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e847
> 3eb5d9=VM]',
> >> sharedLocks=''}'
> >> 2017-07-02 16:43:22,833-04 DEBUG
> >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$
> PostgresSimpleJdbcCall]
> >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled
> stored
> >> procedure. Call string is [{call getdiskvmelementspluggedtovm(?)}]
> >> 2017-07-02 16:43:22,833-04 DEBUG
> >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$
> PostgresSimpleJdbcCall]
> >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
> >> procedure [GetDiskVmElementsPluggedToVm] compiled
> >> 2017-07-02 16:43:22,843-04 DEBUG
> >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$
> PostgresSimpleJdbcCall]
> >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled
> stored
> >> procedure. Call string is [{call getattacheddisksnapshotstovm(?, ?)}]
> >> 2017-07-02 16:43:22,843-04 DEBUG
> >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$
> PostgresSimpleJdbcCall]
> >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
> >> procedure [GetAttachedDiskSnapshotsToVm] compiled
> >> 2017-07-02 16:43:22,919-04 INFO
> >> [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default
> task-27)
> >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Candidate host
> >> 'lago-basic-suite-master-host0' ('46bdc63d-98f5-4eee-81aa-2fb88b8f7cbe')
> was
> >> filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'CPUOverloaded'
> >> (correlation id: null)
> >> 2017-07-02 16:43:22,920-04 WARN
> >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
> >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Validation of action
> >> 'MigrateVmToServer' failed for user admin at internal-authz. Reasons:
> >> VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILT
> ERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
> >> lago-basic-suite-master-host0,$filterName
> >> CPUOverloaded,VAR__DETAIL__CPU_OVERLOADED,SCHEDULING_HOST_
> FILTERED_REASON_WITH_DETAIL
> >>
> >>
> >>
> >> This has nothing to do with migration
> >> The CPUOverload is a scheduling policy, unless there was any change in
> >> that area the obvious explanation would be that the host has a CPU
> overload
> >> condition.
> >> I briefly looked at logs and see ""cpuUser": "83.40", "cpuSys": "16.59",
> >> "cpuIdle": “0.08”” which indeed suggests an overload, from the same
> sample I
> >> can see it’s vdsm ("cpuUserVdsmd": “77.38”, cpuSysVdsmd": “18.44"
> >>
> >> Since similar values are consistently being reported for some time, and
> >> there is a setupNetworks and storage rescan prior to the the failure,
> and
> >> there is no other indication of anything wrong, I’d just say the
> environment
> >> or the order of tests or timing has changed, but nothing wrong with the
> >> oVirt code
> >> Did any of that changed recently? Does it reproduce locally?
> >
> >
> > AFAIK, no significant environment changes or tests were done.
> > We will try to reproduce it locally and also on the manual job,  but from
> > what it looks it is very consistent (unlike other race failures we've
> seen
> > lately ) and continues to fails on the same tests, so its either a
> change in
> > oVirt or something else that we're not thinking on.
> >
> >>
> >>
> >> Thanks,
> >> michal
> >>
> >> 2017-07-02 16:43:22,920-04 INFO
> >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
> >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock freed to object
> >> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e847
> 3eb5d9=VM]',
> >> sharedLocks=''}'
> >> 2017-07-02 16:43:22,929-04 DEBUG
> >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
> >> (DefaultQuartzScheduler7) [] Rescheduling
> >> DEFAULT.org.ovirt.engine.core.bll.ColdRebootAutoStartVmsRunn
> er.startFailedAutoStartVms#-9223372036854775733
> >> as there is no unfired trigger.
> >> 2017-07-02 16:43:22,932-04 ERROR
> >> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource]
> (default
> >> task-27) [] Operation Failed: [Cannot migrate VM. There is no host that
> >> satisfies current scheduling constraints. See below for details:, The
> host
> >> lago-basic-suite-master-host0 did not satisfy internal filter
> CPUOverloaded
> >> because its CPU is too loaded.]
> >> 2017-07-02 16:43:23,331-04 DEBUG
> >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
> >> (DefaultQuartzScheduler2) [] Rescheduling
> >> DEFAULT.org.ovirt.engine.core.bll.HaAutoStartVmsRunner.start
> FailedAutoStartVms#-9223372036854775793
> >> as there is no unfired trigger.
> >> 2017-07-02 16:43:23,332-04 DEBUG
> >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
> >> (DefaultQuartzScheduler2) [] Rescheduling
> >> DEFAULT.org.ovirt.engine.core.bll.tasks.CommandCallbacksPoll
> er.invokeCallbackMethods#-9223372036854775783
> >> as there is no unfired trigger.
> >>
> >> <engine log>
> >>
> >>
> >>
> >> Best Regards,
> >>
> >> Shlomi Ben-David | Software Engineer | Red Hat ISRAEL
> >> RHCSA | RHCVA | RHCE
> >> IRC: shlomibendavid (on #rhev-integ, #rhev-dev, #rhev-ci)
> >>
> >> OPEN SOURCE - 1 4 011 && 011 4 1
> >>
> >> _______________________________________________
> >> Devel mailing list
> >> Devel at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/devel
> >>
> >>
> >>
> >> _______________________________________________
> >> Devel mailing list
> >> Devel at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/devel
> >
> >
> >
> >
> > --
> >
> > Eyal edri
> >
> >
> > ASSOCIATE MANAGER
> >
> > RHV DevOps
> >
> > EMEA VIRTUALIZATION R&D
> >
> >
> > Red Hat EMEA
> >
> > TRIED. TESTED. TRUSTED.
> > phone: +972-9-7692018
> > irc: eedri (on #tlv #rhev-dev #rhev-integ)
> >
> > _______________________________________________
> > Devel mailing list
> > Devel at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/devel
>



-- 

Eyal edri


ASSOCIATE MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D


Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
phone: +972-9-7692018 <+972%209-769-2018>
irc: eedri (on #tlv #rhev-dev #rhev-integ)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20170704/051a037c/attachment-0001.html>


More information about the Devel mailing list