
On Wed, Jan 11, 2017 at 12:42 PM, Francesco Romani <fromani@redhat.com> wrote:
On 01/11/2017 09:29 AM, Yaniv Kaul wrote:
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski < piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log.
Please use my live system: ssh mini@ykaul-mini.tlv.redhat.com (redacted) then run a console to the VM: lagocli --prefix-path /dev/shm/run/current shell engine
(or 'host0' for the host)
Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
It's not.
I fooled myself. We have two hosts here. host0 was holding the VM up until the suspend. Then Engine decided to resume the VM on the other one, host1. While the VM was resuming, host0 experienced network issues, which led to soft-fencing. That explains the mess on host0, even though it is still unclear why host0 had network issues and heartbeat exceeded in the first place.
On host1 the waters are even darker. The VM was resumed ~02:36
2017-01-11 02:36:04,700-05 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] FINISH, CreateVmVDSCommand, return: RestoringState, log id: 378da701 2017-01-11 02:36:04,700-05 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Lock freed to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea64 2caf083=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2017-01-11 02:36:04,704-05 INFO [org.ovirt.engine.core.dal. dbbroker.auditloghandling.AuditLogDirector] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Correlation ID: 72c41f12-649b-4833-8485-44e8d20d2b 49, Job ID: a93b571e-aed1-40e7-8d71-831f646255fb, Call Stack: null, Custom Event ID: -1, Message: VM vm0 was started by admin@internal-authz (Host: lago-basic-suite-master-host1).
While well within the timeout limit (600s), the vm was still restoring its state:
This is an issue I (wrongfully, privately) complained to Michal - resume takes way lot more time than it used to - the 'lzop' process decompressing the memory is taking ages to complete. This is reproducible, regardless of the other issues. I did not open a bug about it :( Y.
2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.engine.core. vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [5582058d] START, GetAllVmStatsVDSCommand(HostName = lago-basic-suite-master-host1, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='0336661b-721f-4c55-9327-8fd2fd3a0542'}), log id: 1803c51b 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc. client.reactors.stomp.impl.Message] (DefaultQuartzScheduler8) [5582058d] SEND destination:jms.topic.vdsm_requests reply-to:jms.topic.vdsm_responses content-length:103
{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{ },"id":"f2997c1d-49cc-4d2e-937c-e4910fbb75df"}^@ 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc. client.reactors.stomp.StompCommonClient] (DefaultQuartzScheduler8) [5582058d] Message sent: SEND destination:jms.topic.vdsm_requests content-length:103 reply-to:jms.topic.vdsm_responses
<JsonRpcRequest id: "f2997c1d-49cc-4d2e-937c-e4910fbb75df", method: Host.getAllVmStats, params: {}> 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc. client.reactors.stomp.impl.Message] (SSL Stomp Reactor) [5e453618] MESSAGE content-length:829 destination:jms.topic.vdsm_responses content-type:application/json subscription:fe930de2-aa67-4bc4-a34c-be22edd1623e
{"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}], "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4", "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", "statusTime": "4295562220", "vmName": "vm0"}]}^@ 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}], "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4", "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", "statusTime": "4295562220", "vmName": "vm0"}]}
Then a failed migration attempt, which fails
2017-01-11 02:46:07,826-05 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock Acquired to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', sharedLocks='null'}' 2017-01-11 02:46:07,827-05 WARN [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Validation of action 'MigrateVmToServer' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__HOST_STATUS__UP, ACTION_TYPE_FAILED_VDS_STATUS_ILLEGAL 2017-01-11 02:46:07,827-05 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock freed to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', sharedLocks='null'}'
which makes sense, host0 was being reinstalled (!?) meantime.
After that I see no more evidence of 'vm0' in the logs. I couldn't yet cross-correlate with the tests, perhaps the test failed meantime?
TL;DR: we need a reproducer to keep investigating this, so far I've seen quite some suspicious activity but nothing really concrete.
PS: last few lago failures I've seen are been triggered by completely unrelated patches, so far I thik they are sporadic failures
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani