On 4 Jul 2017, at 13:00, Eyal Edri <eedri@redhat.com> wrote:I was able to reproduce the error [1] on a manual run with only new vdsm from [2],and also to verify that w/o this change, while using latest tested run [3] it works.So I think this proves quite clearly the problem is one of the latest VDSM patches.There is only a single patch between vdsms [1] and [3]I'm running again the test with the suspected bad VDSM and hopefully will be able to extract the env to tar.gz filewhich anyone can import using the lago demo tool.On Tue, Jul 4, 2017 at 1:30 PM, Nadav Goldin <ngoldin@redhat.com> wrote:Hi, sorry for posting late, I had a brief look at this yesterday:
1. I couldn't replicate it locally - which means it is most likely a
recent change.
2. I looked at the libvirt XMLs Lago generatd for the hosts, as a new
version is used this week(0.40) - and they seem OK - specifically
memroy and vcpus(which was my initial suspect).
3. I saw two Engine patches, a bit prior to the time it started to
fail, which *might* in my common sense be related, but it is out of my
scope to tell(CC'ed patch owners):
core: Make VmAnalyzer to treat a migrated Paused VM as success -
https://gerrit.ovirt.org/78305
fix custom fencing default config setting
https://gerrit.ovirt.org/78720
Shot in the wild - Could it be that the 'CPUOverload' filter was not
active before for some reason?
Also, there are some exceptions in host0 vdsm log[1], failing to get
VM stats, though I can't tell if they are specific to this failure.
Of course this is not a complete analysis, I hope it helps.
[1] http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ma ster/7431/artifact/exported-ar tifacts/basic-suit-master-el7/ test_logs/basic-suite-master/p ost-006_migrations.py/lago-bas ic-suite-master-host0/_var_log /vdsm/vdsm.log
Nadav.
> TRIED. TESTED. TRUSTED.
On Tue, Jul 4, 2017 at 12:46 PM, Eyal Edri <eedri@redhat.com> wrote:
>
>
> On Tue, Jul 4, 2017 at 12:18 PM, Michal Skrivanek
> <michal.skrivanek@redhat.com> wrote:
>>
>>
>> On 3 Jul 2017, at 15:35, Shlomo Ben David <sbendavi@redhat.com> wrote:
>>
>> Hi,
>>
>> Test failed: [ 006_migrations.migrate_vm ]
>> Link to suspected patches: N/A
>> Link to Job:
>> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ma ster/7431/
>> Link to all logs:
>> Error snippet from the log:
>> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ma ster/7431/artifact/exported-ar tifacts/basic-suit-master-el7/ test_logs/basic-suite-master/p ost-006_migrations.py/
>>
>> <error>
>>
>> "Fault reason is "Operation Failed". Fault detail is "[Cannot migrate VM.
>> There is no host that satisfies current scheduling constraints. See below
>> for details:, The host lago-basic-suite-master-host0 did not satisfy
>> internal filter CPUOverloaded because its CPU is too loaded.]"
>>
>> </error>
>>
>> <engine log>
>>
>> 2017-07-02 16:43:22,829-04 INFO
>> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock Acquired to object
>> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e847 3eb5d9=VM]',
>> sharedLocks=''}'
>> 2017-07-02 16:43:22,833-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$ PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored
>> procedure. Call string is [{call getdiskvmelementspluggedtovm(?)}]
>> 2017-07-02 16:43:22,833-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$ PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
>> procedure [GetDiskVmElementsPluggedToVm] compiled
>> 2017-07-02 16:43:22,843-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$ PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored
>> procedure. Call string is [{call getattacheddisksnapshotstovm(?, ?)}]
>> 2017-07-02 16:43:22,843-04 DEBUG
>> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$ PostgresSimpleJdbcCall]
>> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for
>> procedure [GetAttachedDiskSnapshotsToVm] compiled
>> 2017-07-02 16:43:22,919-04 INFO
>> [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Candidate host
>> 'lago-basic-suite-master-host0' ('46bdc63d-98f5-4eee-81aa-2fb8 8b8f7cbe') was
>> filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'CPUOverloaded'
>> (correlation id: null)
>> 2017-07-02 16:43:22,920-04 WARN
>> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Validation of action
>> 'MigrateVmToServer' failed for user admin@internal-authz. Reasons:
>> VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILT ERED_OUT,VAR__FILTERTYPE__INTE RNAL,$hostName
>> lago-basic-suite-master-host0,$filterName
>> CPUOverloaded,VAR__DETAIL__CPU_OVERLOADED,SCHEDULING_HOST_FI LTERED_REASON_WITH_DETAIL
>>
>>
>>
>> This has nothing to do with migration
>> The CPUOverload is a scheduling policy, unless there was any change in
>> that area the obvious explanation would be that the host has a CPU overload
>> condition.
>> I briefly looked at logs and see ""cpuUser": "83.40", "cpuSys": "16.59",
>> "cpuIdle": “0.08”” which indeed suggests an overload, from the same sample I
>> can see it’s vdsm ("cpuUserVdsmd": “77.38”, cpuSysVdsmd": “18.44"
>>
>> Since similar values are consistently being reported for some time, and
>> there is a setupNetworks and storage rescan prior to the the failure, and
>> there is no other indication of anything wrong, I’d just say the environment
>> or the order of tests or timing has changed, but nothing wrong with the
>> oVirt code
>> Did any of that changed recently? Does it reproduce locally?
>
>
> AFAIK, no significant environment changes or tests were done.
> We will try to reproduce it locally and also on the manual job, but from
> what it looks it is very consistent (unlike other race failures we've seen
> lately ) and continues to fails on the same tests, so its either a change in
> oVirt or something else that we're not thinking on.
>
>>
>>
>> Thanks,
>> michal
>>
>> 2017-07-02 16:43:22,920-04 INFO
>> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27)
>> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock freed to object
>> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e847 3eb5d9=VM]',
>> sharedLocks=''}'
>> 2017-07-02 16:43:22,929-04 DEBUG
>> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>> (DefaultQuartzScheduler7) [] Rescheduling
>> DEFAULT.org.ovirt.engine.core.bll.ColdRebootAutoStartVmsRunn er.startFailedAutoStartVms#-92 23372036854775733
>> as there is no unfired trigger.
>> 2017-07-02 16:43:22,932-04 ERROR
>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResour ce] (default
>> task-27) [] Operation Failed: [Cannot migrate VM. There is no host that
>> satisfies current scheduling constraints. See below for details:, The host
>> lago-basic-suite-master-host0 did not satisfy internal filter CPUOverloaded
>> because its CPU is too loaded.]
>> 2017-07-02 16:43:23,331-04 DEBUG
>> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>> (DefaultQuartzScheduler2) [] Rescheduling
>> DEFAULT.org.ovirt.engine.core.bll.HaAutoStartVmsRunner.start FailedAutoStartVms#-9223372036 854775793
>> as there is no unfired trigger.
>> 2017-07-02 16:43:23,332-04 DEBUG
>> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener]
>> (DefaultQuartzScheduler2) [] Rescheduling
>> DEFAULT.org.ovirt.engine.core.bll.tasks.CommandCallbacksPoll er.invokeCallbackMethods#-9223 372036854775783
>> as there is no unfired trigger.
>>
>> <engine log>
>>
>>
>>
>> Best Regards,
>>
>> Shlomi Ben-David | Software Engineer | Red Hat ISRAEL
>> RHCSA | RHCVA | RHCE
>> IRC: shlomibendavid (on #rhev-integ, #rhev-dev, #rhev-ci)
>>
>> OPEN SOURCE - 1 4 011 && 011 4 1
>>
>> _______________________________________________
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>>
>>
>> _______________________________________________
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>
>
>
>
> --
>
> Eyal edri
>
>
> ASSOCIATE MANAGER
>
> RHV DevOps
>
> EMEA VIRTUALIZATION R&D
>
>
> Red Hat EMEA
>
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>
> _______________________________________________
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
--phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)