I was able to reproduce the error [1] on a manual run with only new vdsm from [2],
and also to verify that w/o this change, while using latest tested run [3] it works.

So I think this proves quite clearly the problem is one of the latest VDSM patches.

I'm running again the test with the suspected bad VDSM and hopefully will be able to extract the env to tar.gz file
which anyone can import using the lago demo tool.



[1] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/748/
[2] http://jenkins.ovirt.org/job/vdsm_master_build-artifacts-el7-x86_64/2694/
[3] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/747/



On Tue, Jul 4, 2017 at 1:30 PM, Nadav Goldin <ngoldin@redhat.com> wrote:
Hi, sorry for posting late, I had a brief look at this yesterday:
1. I couldn't replicate it locally - which means it is most likely a
recent change.
2. I looked at the libvirt XMLs Lago generatd for the hosts, as a new
version is used this week(0.40) - and they seem OK - specifically
memroy and vcpus(which was my initial suspect).
3. I saw two Engine patches, a bit prior to the time it started to
fail, which *might* in my common sense be related, but it is out of my
scope to tell(CC'ed patch owners):

core: Make VmAnalyzer to treat a migrated Paused VM as success -
https://gerrit.ovirt.org/78305

fix custom fencing default config setting
https://gerrit.ovirt.org/78720

Shot in the wild - Could it be that the 'CPUOverload' filter was not
active before for some reason?

Also, there are some exceptions in host0 vdsm log[1], failing to get
VM stats, though I can't tell if they are specific to this failure.

Of course this is not a complete analysis, I hope it helps.


[1] http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log


Nadav.





On Tue, Jul 4, 2017 at 12:46 PM, Eyal Edri <eedri@redhat.com> wrote:
>
>
> On Tue, Jul 4, 2017 at 12:18 PM, Michal Skrivanek
> <michal.skrivanek@redhat.com> wrote:
>>
>>
>> On 3 Jul 2017, at 15:35, Shlomo Ben David <sbendavi@redhat.com> wrote:
>>
>> Hi,
>>
>> Test failed: [ 006_migrations.migrate_vm ]
>> Link to suspected patches: N/A
>> Link to Job:
>> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/
>> Link to all logs:
>> Error snippet from the log:
>> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/
>>
>> <error>
>>
>>  "Fault reason is "Operation Failed". Fault detail is "[Cannot migrate VM.
>> There is no host that satisfies current scheduling constraints. See below
>> for details:, The host lago-basic-suite-master-host0 did not satisfy
>> internal filter CPUOverloaded because its CPU is too loaded.]"
>>
>> </error>
>>
>> <engine log>
>>
>> 2017-07-02 16:43:22,829-04 INFO
>> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock Acquired to object
>> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]',
>> sharedLocks=''}'
>> 2017-07-02 16:43:22,833-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored
>> procedure. Call string is [{call getdiskvmelementspluggedtovm(?)}]
>> 2017-07-02 16:43:22,833-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
>> procedure [GetDiskVmElementsPluggedToVm] compiled
>> 2017-07-02 16:43:22,843-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored
>> procedure. Call string is [{call getattacheddisksnapshotstovm(?, ?)}]
>> 2017-07-02 16:43:22,843-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
>> procedure [GetAttachedDiskSnapshotsToVm] compiled
>> 2017-07-02 16:43:22,919-04 INFO
>> [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Candidate host
>> 'lago-basic-suite-master-host0' ('46bdc63d-98f5-4eee-81aa-2fb88b8f7cbe') was
>> filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'CPUOverloaded'
>> (correlation id: null)
>> 2017-07-02 16:43:22,920-04 WARN
>> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Validation of action
>> 'MigrateVmToServer' failed for user admin@internal-authz. Reasons:
>> VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
>> lago-basic-suite-master-host0,$filterName
>> CPUOverloaded,VAR__DETAIL__CPU_OVERLOADED,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL
>>
>>
>>
>> This has nothing to do with migration
>> The CPUOverload is a scheduling policy, unless there was any change in
>> that area the obvious explanation would be that the host has a CPU overload
>> condition.
>> I briefly looked at logs and see ""cpuUser": "83.40", "cpuSys": "16.59",
>> "cpuIdle": “0.08”” which indeed suggests an overload, from the same sample I
>> can see it’s vdsm ("cpuUserVdsmd": “77.38”, cpuSysVdsmd": “18.44"
>>
>> Since similar values are consistently being reported for some time, and
>> there is a setupNetworks and storage rescan prior to the the failure, and
>> there is no other indication of anything wrong, I’d just say the environment
>> or the order of tests or timing has changed, but nothing wrong with the
>> oVirt code
>> Did any of that changed recently? Does it reproduce locally?
>
>
> AFAIK, no significant environment changes or tests were done.
> We will try to reproduce it locally and also on the manual job,  but from
> what it looks it is very consistent (unlike other race failures we've seen
> lately ) and continues to fails on the same tests, so its either a change in
> oVirt or something else that we're not thinking on.
>
>>
>>
>> Thanks,
>> michal
>>
>> 2017-07-02 16:43:22,920-04 INFO
>> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock freed to object
>> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]',
>> sharedLocks=''}'
>> 2017-07-02 16:43:22,929-04 DEBUG
>> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>> (DefaultQuartzScheduler7) [] Rescheduling
>> DEFAULT.org.ovirt.engine.core.bll.ColdRebootAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775733
>> as there is no unfired trigger.
>> 2017-07-02 16:43:22,932-04 ERROR
>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default
>> task-27) [] Operation Failed: [Cannot migrate VM. There is no host that
>> satisfies current scheduling constraints. See below for details:, The host
>> lago-basic-suite-master-host0 did not satisfy internal filter CPUOverloaded
>> because its CPU is too loaded.]
>> 2017-07-02 16:43:23,331-04 DEBUG
>> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>> (DefaultQuartzScheduler2) [] Rescheduling
>> DEFAULT.org.ovirt.engine.core.bll.HaAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775793
>> as there is no unfired trigger.
>> 2017-07-02 16:43:23,332-04 DEBUG
>> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>> (DefaultQuartzScheduler2) [] Rescheduling
>> DEFAULT.org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods#-9223372036854775783
>> as there is no unfired trigger.
>>
>> <engine log>
>>
>>
>>
>> Best Regards,
>>
>> Shlomi Ben-David | Software Engineer | Red Hat ISRAEL
>> RHCSA | RHCVA | RHCE
>> IRC: shlomibendavid (on #rhev-integ, #rhev-dev, #rhev-ci)
>>
>> OPEN SOURCE - 1 4 011 && 011 4 1
>>
>> _______________________________________________
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>>
>>
>> _______________________________________________
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>
>
>
>
> --
>
> Eyal edri
>
>
> ASSOCIATE MANAGER
>
> RHV DevOps
>
> EMEA VIRTUALIZATION R&D
>
>
> Red Hat EMEA
>
> TRIED. TESTED. TRUSTED.
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>
> _______________________________________________
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel



--

Eyal edri


ASSOCIATE MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D


Red Hat EMEA

TRIED. TESTED. TRUSTED.
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)