[ovirt-devel] [ OST Failure Report ] [ oVirt master ] [ 03-07-2017 ] [ 006_migrations.migrate_vm ]

Nadav Goldin ngoldin at redhat.com
Tue Jul 4 10:30:42 UTC 2017


Hi, sorry for posting late, I had a brief look at this yesterday:
1. I couldn't replicate it locally - which means it is most likely a
recent change.
2. I looked at the libvirt XMLs Lago generatd for the hosts, as a new
version is used this week(0.40) - and they seem OK - specifically
memroy and vcpus(which was my initial suspect).
3. I saw two Engine patches, a bit prior to the time it started to
fail, which *might* in my common sense be related, but it is out of my
scope to tell(CC'ed patch owners):

core: Make VmAnalyzer to treat a migrated Paused VM as success -
https://gerrit.ovirt.org/78305

fix custom fencing default config setting
https://gerrit.ovirt.org/78720

Shot in the wild - Could it be that the 'CPUOverload' filter was not
active before for some reason?

Also, there are some exceptions in host0 vdsm log[1], failing to get
VM stats, though I can't tell if they are specific to this failure.

Of course this is not a complete analysis, I hope it helps.


[1] http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log


Nadav.





On Tue, Jul 4, 2017 at 12:46 PM, Eyal Edri <eedri at redhat.com> wrote:
>
>
> On Tue, Jul 4, 2017 at 12:18 PM, Michal Skrivanek
> <michal.skrivanek at redhat.com> wrote:
>>
>>
>> On 3 Jul 2017, at 15:35, Shlomo Ben David <sbendavi at redhat.com> wrote:
>>
>> Hi,
>>
>> Test failed: [ 006_migrations.migrate_vm ]
>> Link to suspected patches: N/A
>> Link to Job:
>> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/
>> Link to all logs:
>> Error snippet from the log:
>> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/
>>
>> <error>
>>
>>  "Fault reason is "Operation Failed". Fault detail is "[Cannot migrate VM.
>> There is no host that satisfies current scheduling constraints. See below
>> for details:, The host lago-basic-suite-master-host0 did not satisfy
>> internal filter CPUOverloaded because its CPU is too loaded.]"
>>
>> </error>
>>
>> <engine log>
>>
>> 2017-07-02 16:43:22,829-04 INFO
>> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock Acquired to object
>> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]',
>> sharedLocks=''}'
>> 2017-07-02 16:43:22,833-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored
>> procedure. Call string is [{call getdiskvmelementspluggedtovm(?)}]
>> 2017-07-02 16:43:22,833-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
>> procedure [GetDiskVmElementsPluggedToVm] compiled
>> 2017-07-02 16:43:22,843-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored
>> procedure. Call string is [{call getattacheddisksnapshotstovm(?, ?)}]
>> 2017-07-02 16:43:22,843-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
>> procedure [GetAttachedDiskSnapshotsToVm] compiled
>> 2017-07-02 16:43:22,919-04 INFO
>> [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Candidate host
>> 'lago-basic-suite-master-host0' ('46bdc63d-98f5-4eee-81aa-2fb88b8f7cbe') was
>> filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'CPUOverloaded'
>> (correlation id: null)
>> 2017-07-02 16:43:22,920-04 WARN
>> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Validation of action
>> 'MigrateVmToServer' failed for user admin at internal-authz. Reasons:
>> VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
>> lago-basic-suite-master-host0,$filterName
>> CPUOverloaded,VAR__DETAIL__CPU_OVERLOADED,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL
>>
>>
>>
>> This has nothing to do with migration
>> The CPUOverload is a scheduling policy, unless there was any change in
>> that area the obvious explanation would be that the host has a CPU overload
>> condition.
>> I briefly looked at logs and see ""cpuUser": "83.40", "cpuSys": "16.59",
>> "cpuIdle": “0.08”” which indeed suggests an overload, from the same sample I
>> can see it’s vdsm ("cpuUserVdsmd": “77.38”, cpuSysVdsmd": “18.44"
>>
>> Since similar values are consistently being reported for some time, and
>> there is a setupNetworks and storage rescan prior to the the failure, and
>> there is no other indication of anything wrong, I’d just say the environment
>> or the order of tests or timing has changed, but nothing wrong with the
>> oVirt code
>> Did any of that changed recently? Does it reproduce locally?
>
>
> AFAIK, no significant environment changes or tests were done.
> We will try to reproduce it locally and also on the manual job,  but from
> what it looks it is very consistent (unlike other race failures we've seen
> lately ) and continues to fails on the same tests, so its either a change in
> oVirt or something else that we're not thinking on.
>
>>
>>
>> Thanks,
>> michal
>>
>> 2017-07-02 16:43:22,920-04 INFO
>> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock freed to object
>> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]',
>> sharedLocks=''}'
>> 2017-07-02 16:43:22,929-04 DEBUG
>> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>> (DefaultQuartzScheduler7) [] Rescheduling
>> DEFAULT.org.ovirt.engine.core.bll.ColdRebootAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775733
>> as there is no unfired trigger.
>> 2017-07-02 16:43:22,932-04 ERROR
>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default
>> task-27) [] Operation Failed: [Cannot migrate VM. There is no host that
>> satisfies current scheduling constraints. See below for details:, The host
>> lago-basic-suite-master-host0 did not satisfy internal filter CPUOverloaded
>> because its CPU is too loaded.]
>> 2017-07-02 16:43:23,331-04 DEBUG
>> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>> (DefaultQuartzScheduler2) [] Rescheduling
>> DEFAULT.org.ovirt.engine.core.bll.HaAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775793
>> as there is no unfired trigger.
>> 2017-07-02 16:43:23,332-04 DEBUG
>> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>> (DefaultQuartzScheduler2) [] Rescheduling
>> DEFAULT.org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods#-9223372036854775783
>> as there is no unfired trigger.
>>
>> <engine log>
>>
>>
>>
>> Best Regards,
>>
>> Shlomi Ben-David | Software Engineer | Red Hat ISRAEL
>> RHCSA | RHCVA | RHCE
>> IRC: shlomibendavid (on #rhev-integ, #rhev-dev, #rhev-ci)
>>
>> OPEN SOURCE - 1 4 011 && 011 4 1
>>
>> _______________________________________________
>> Devel mailing list
>> Devel at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>>
>>
>> _______________________________________________
>> Devel mailing list
>> Devel at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>
>
>
>
> --
>
> Eyal edri
>
>
> ASSOCIATE MANAGER
>
> RHV DevOps
>
> EMEA VIRTUALIZATION R&D
>
>
> Red Hat EMEA
>
> TRIED. TESTED. TRUSTED.
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>
> _______________________________________________
> Devel mailing list
> Devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel


More information about the Devel mailing list