
Hi, sorry for posting late, I had a brief look at this yesterday: 1. I couldn't replicate it locally - which means it is most likely a recent change. 2. I looked at the libvirt XMLs Lago generatd for the hosts, as a new version is used this week(0.40) - and they seem OK - specifically memroy and vcpus(which was my initial suspect). 3. I saw two Engine patches, a bit prior to the time it started to fail, which *might* in my common sense be related, but it is out of my scope to tell(CC'ed patch owners): core: Make VmAnalyzer to treat a migrated Paused VM as success - https://gerrit.ovirt.org/78305 fix custom fencing default config setting https://gerrit.ovirt.org/78720 Shot in the wild - Could it be that the 'CPUOverload' filter was not active before for some reason? Also, there are some exceptions in host0 vdsm log[1], failing to get VM stats, though I can't tell if they are specific to this failure. Of course this is not a complete analysis, I hope it helps. [1] http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifa... Nadav. On Tue, Jul 4, 2017 at 12:46 PM, Eyal Edri <eedri@redhat.com> wrote:
On Tue, Jul 4, 2017 at 12:18 PM, Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
On 3 Jul 2017, at 15:35, Shlomo Ben David <sbendavi@redhat.com> wrote:
Hi,
Test failed: [ 006_migrations.migrate_vm ] Link to suspected patches: N/A Link to Job: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/ Link to all logs: Error snippet from the log: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifa...
<error>
"Fault reason is "Operation Failed". Fault detail is "[Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:, The host lago-basic-suite-master-host0 did not satisfy internal filter CPUOverloaded because its CPU is too loaded.]"
</error>
<engine log>
2017-07-02 16:43:22,829-04 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock Acquired to object 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]', sharedLocks=''}' 2017-07-02 16:43:22,833-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored procedure. Call string is [{call getdiskvmelementspluggedtovm(?)}] 2017-07-02 16:43:22,833-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for procedure [GetDiskVmElementsPluggedToVm] compiled 2017-07-02 16:43:22,843-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored procedure. Call string is [{call getattacheddisksnapshotstovm(?, ?)}] 2017-07-02 16:43:22,843-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for procedure [GetAttachedDiskSnapshotsToVm] compiled 2017-07-02 16:43:22,919-04 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Candidate host 'lago-basic-suite-master-host0' ('46bdc63d-98f5-4eee-81aa-2fb88b8f7cbe') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'CPUOverloaded' (correlation id: null) 2017-07-02 16:43:22,920-04 WARN [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Validation of action 'MigrateVmToServer' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName lago-basic-suite-master-host0,$filterName CPUOverloaded,VAR__DETAIL__CPU_OVERLOADED,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL
This has nothing to do with migration The CPUOverload is a scheduling policy, unless there was any change in that area the obvious explanation would be that the host has a CPU overload condition. I briefly looked at logs and see ""cpuUser": "83.40", "cpuSys": "16.59", "cpuIdle": “0.08”” which indeed suggests an overload, from the same sample I can see it’s vdsm ("cpuUserVdsmd": “77.38”, cpuSysVdsmd": “18.44"
Since similar values are consistently being reported for some time, and there is a setupNetworks and storage rescan prior to the the failure, and there is no other indication of anything wrong, I’d just say the environment or the order of tests or timing has changed, but nothing wrong with the oVirt code Did any of that changed recently? Does it reproduce locally?
AFAIK, no significant environment changes or tests were done. We will try to reproduce it locally and also on the manual job, but from what it looks it is very consistent (unlike other race failures we've seen lately ) and continues to fails on the same tests, so its either a change in oVirt or something else that we're not thinking on.
Thanks, michal
2017-07-02 16:43:22,920-04 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock freed to object 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]', sharedLocks=''}' 2017-07-02 16:43:22,929-04 DEBUG [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] (DefaultQuartzScheduler7) [] Rescheduling DEFAULT.org.ovirt.engine.core.bll.ColdRebootAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775733 as there is no unfired trigger. 2017-07-02 16:43:22,932-04 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-27) [] Operation Failed: [Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:, The host lago-basic-suite-master-host0 did not satisfy internal filter CPUOverloaded because its CPU is too loaded.] 2017-07-02 16:43:23,331-04 DEBUG [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] (DefaultQuartzScheduler2) [] Rescheduling DEFAULT.org.ovirt.engine.core.bll.HaAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775793 as there is no unfired trigger. 2017-07-02 16:43:23,332-04 DEBUG [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] (DefaultQuartzScheduler2) [] Rescheduling DEFAULT.org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods#-9223372036854775783 as there is no unfired trigger.
<engine log>
Best Regards,
Shlomi Ben-David | Software Engineer | Red Hat ISRAEL RHCSA | RHCVA | RHCE IRC: shlomibendavid (on #rhev-integ, #rhev-dev, #rhev-ci)
OPEN SOURCE - 1 4 011 && 011 4 1
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
--
Eyal edri
ASSOCIATE MANAGER
RHV DevOps
EMEA VIRTUALIZATION R&D
Red Hat EMEA
TRIED. TESTED. TRUSTED. phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel