This is a multi-part message in MIME format.
--------------03A5C4A20FB28EE021E60761
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
On 01/11/2017 09:29 AM, Yaniv Kaul wrote:
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani(a)redhat.com
<mailto:fromani@redhat.com>> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
> Adding Tomas from Virt.
>
> On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski
> <piotr.kliczewski(a)gmail.com <mailto:piotr.kliczewski@gmail.com>>
> wrote:
>
> On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky
> <dbelenky(a)redhat.com <mailto:dbelenky@redhat.com>> wrote:
> > Hi all,
> >
> > test-repo_ovirt_experimental_master (link to Jenkins) job
> failed on
> > basic_sanity scenario.
> > The job was triggered by
>
https://gerrit.ovirt.org/#/c/69845/
> <
https://gerrit.ovirt.org/#/c/69845/>
> >
> > From looking at the logs, it seems that the reason is VDSM.
> >
> > In the VDSM log, i see the following error:
> >
> > 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor))
> [vds.dispatcher] SSL
> > error receiving from <yajsonrpc.betterAsyncore.Dispatcher
> connected ('::1',
> > 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
>
Daniel, could you please remind me the jenkins link? I see
something suspicious on the Vdsm log.
Please use my live system:
ssh mini(a)ykaul-mini.tlv.redhat.com
<mailto:mini@ykaul-mini.tlv.redhat.com> (redacted)
then run a console to the VM:
lagocli --prefix-path /dev/shm/run/current shell engine
(or 'host0' for the host)
Most notably, Vdsm received SIGTERM. Is this expected and part of
the test?
It's not.
I fooled myself. We have two hosts here. host0 was holding the VM up
until the suspend. Then Engine decided to resume the VM on the other
one, host1.
While the VM was resuming, host0 experienced network issues, which led
to soft-fencing. That explains the mess on host0, even though it is still
unclear why host0 had network issues and heartbeat exceeded in the first
place.
On host1 the waters are even darker. The VM was resumed ~02:36
2017-01-11 02:36:04,700-05 INFO
[org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (default task-17)
[72c41f12-649b-4833-8485-44e8d20d2b49] FINISH, CreateVmVDSCommand,
return: RestoringState, log id: 378da701
2017-01-11 02:36:04,700-05 INFO
[org.ovirt.engine.core.bll.RunVmCommand] (default task-17)
[72c41f12-649b-4833-8485-44e8d20d2b49] Lock freed to object
'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea64
2caf083=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]',
sharedLocks='null'}'
2017-01-11 02:36:04,704-05 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Correlation ID:
72c41f12-649b-4833-8485-44e8d20d2b
49, Job ID: a93b571e-aed1-40e7-8d71-831f646255fb, Call Stack: null,
Custom Event ID: -1, Message: VM vm0 was started by admin@internal-authz
(Host: lago-basic-suite-master-host1).
While well within the timeout limit (600s), the vm was still restoring
its state:
2017-01-11 02:37:31,059-05 DEBUG
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand]
(DefaultQuartzScheduler8) [5582058d] START,
GetAllVmStatsVDSCommand(HostName = lago-basic-suite-master-host1,
VdsIdVDSCommandParametersBase:{runAsync='true',
hostId='0336661b-721f-4c55-9327-8fd2fd3a0542'}), log id: 1803c51b
2017-01-11 02:37:31,059-05 DEBUG
[org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message]
(DefaultQuartzScheduler8) [5582058d] SEND
destination:jms.topic.vdsm_requests
reply-to:jms.topic.vdsm_responses
content-length:103
{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"id":"f2997c1d-49cc-4d2e-937c-e4910fbb75df"}^@
2017-01-11 02:37:31,059-05 DEBUG
[org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
(DefaultQuartzScheduler8) [5582058d] Message sent: SEND
destination:jms.topic.vdsm_requests
content-length:103
reply-to:jms.topic.vdsm_responses
<JsonRpcRequest id: "f2997c1d-49cc-4d2e-937c-e4910fbb75df", method:
Host.getAllVmStats, params: {}>
2017-01-11 02:37:31,062-05 DEBUG
[org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message] (SSL Stomp
Reactor) [5e453618] MESSAGE
content-length:829
destination:jms.topic.vdsm_responses
content-type:application/json
subscription:fe930de2-aa67-4bc4-a34c-be22edd1623e
{"jsonrpc": "2.0", "id":
"f2997c1d-49cc-4d2e-937c-e4910fbb75df",
"result": [{"username": "Unknown", "displayInfo":
[{"tlsPort": "-1",
"ipAddress": "192.168.201.4", "type": "vnc",
"port": "-1"}, {"tlsPort":
"-1", "ipAddress": "192.168.201.4", "type":
"spice", "port": "-1"}],
"hash": "-4942054084956770103", "acpiEnable":
"true", "displayIp":
"192.168.201.4", "guestFQDN": "", "vmId":
"42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0",
"timeOffset": "0",
"session": "Unknown", "displaySecurePort": "-1",
"displayPort": "-1",
"memUsage": "0", "guestIPs": "",
"kvmEnable": "true", "displayType":
"vnc", "cpuUser": "0.00", "monitorResponse":
"0", "elapsedTime": "389",
"vmType": "kvm", "cpuSys": "0.00",
"status": "Restoring state",
"guestCPUCount": -1, "appsList": [], "clientIp":
"", "statusTime":
"4295562220", "vmName": "vm0"}]}^@
2017-01-11 02:37:31,062-05 DEBUG
[org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker)
[] Message received: {"jsonrpc": "2.0", "id":
"f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result":
[{"username":
"Unknown", "displayInfo": [{"tlsPort": "-1",
"ipAddress":
"192.168.201.4", "type": "vnc", "port":
"-1"}, {"tlsPort": "-1",
"ipAddress": "192.168.201.4", "type": "spice",
"port": "-1"}], "hash":
"-4942054084956770103", "acpiEnable": "true",
"displayIp":
"192.168.201.4", "guestFQDN": "", "vmId":
"42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0",
"timeOffset": "0",
"session": "Unknown", "displaySecurePort": "-1",
"displayPort": "-1",
"memUsage": "0", "guestIPs": "",
"kvmEnable": "true", "displayType":
"vnc", "cpuUser": "0.00", "monitorResponse":
"0", "elapsedTime": "389",
"vmType": "kvm", "cpuSys": "0.00",
"status": "Restoring state",
"guestCPUCount": -1, "appsList": [], "clientIp":
"", "statusTime":
"4295562220", "vmName": "vm0"}]}
Then a failed migration attempt, which fails
2017-01-11 02:46:07,826-05 INFO
[org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4)
[27938262-f5fe-41ff-a662-5177a3a25ada] Lock Acquired to object
'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM,
ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]',
sharedLocks='null'}'
2017-01-11 02:46:07,827-05 WARN
[org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4)
[27938262-f5fe-41ff-a662-5177a3a25ada] Validation of action
'MigrateVmToServer' failed for user admin@internal-authz. Reasons:
VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__HOST_STATUS__UP,ACTION_TYPE_FAILED_VDS_STATUS_ILLEGAL
2017-01-11 02:46:07,827-05 INFO
[org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4)
[27938262-f5fe-41ff-a662-5177a3a25ada] Lock freed to object
'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM,
ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]',
sharedLocks='null'}'
which makes sense, host0 was being reinstalled (!?) meantime.
After that I see no more evidence of 'vm0' in the logs. I couldn't yet
cross-correlate with the tests, perhaps the test failed meantime?
TL;DR: we need a reproducer to keep investigating this, so far I've seen
quite some suspicious activity but nothing really concrete.
PS: last few lago failures I've seen are been triggered by completely
unrelated patches, so far I thik they are sporadic failures
--
Francesco Romani
Red Hat Engineering Virtualization R & D
IRC: fromani
--------------03A5C4A20FB28EE021E60761
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit
<html>
<head>
<meta content="text/html; charset=utf-8"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p><br>
</p>
<br>
<div class="moz-cite-prefix">On 01/11/2017 09:29 AM, Yaniv Kaul
wrote:<br>
</div>
<blockquote
cite="mid:CAJgorsZPeVA_1u=Pp16MbQYsk_gUUqzw0V4rPekdJw2MFwNoqw@mail.gmail.com"
type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Jan 11, 2017 at 10:26 AM,
Francesco Romani <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:fromani@redhat.com"
target="_blank">fromani(a)redhat.com</a>&gt;</span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>Hi all<br>
</p>
<span class="gmail-"> <br>
<div
class="gmail-m_-5785517619226444427moz-cite-prefix">On
01/11/2017 08:52 AM, Eyal Edri wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Adding Tomas from Virt.</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Tue, Jan 10, 2017 at
10:54 AM, Piotr Kliczewski <span
dir="ltr"><<a
moz-do-not-send="true"
href="mailto:piotr.kliczewski@gmail.com"
target="_blank">piotr.kliczewski(a)gmail.com</a>&gt;</span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"><span>On
Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky
<<a moz-do-not-send="true"
href="mailto:dbelenky@redhat.com"
target="_blank">dbelenky(a)redhat.com</a>&gt;
wrote:<br>
> Hi all,<br>
><br>
> test-repo_ovirt_experimental_m<wbr>aster
(link to Jenkins) job failed on<br>
> basic_sanity scenario.<br>
> The job was triggered by <a
moz-do-not-send="true"
href="https://gerrit.ovirt.org/#/c/69845/"
rel="noreferrer"
target="_blank">https://gerrit.ovirt.org/#/c/6<wbr>984...
><br>
> From looking at the logs, it seems that
the reason is VDSM.<br>
><br>
> In the VDSM log, i see the following
error:<br>
><br>
> 2017-01-09 16:47:41,331 ERROR (JsonRpc
(StompReactor)) [vds.dispatcher] SSL<br>
> error receiving from
<yajsonrpc.betterAsyncore.Disp<wbr>atcher
connected ('::1',<br>
> 34942, 0, 0) at 0x36b95f0>:
unexpected eof (betterAsyncore:119)<br>
</span></blockquote>
</div>
</div>
</blockquote>
<br>
</span> Daniel, could you please remind me the jenkins
link? I see something suspicious on the Vdsm log.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Please use my live system:</div>
<div style="font-size:12.8px">ssh <a
moz-do-not-send="true"
href="mailto:mini@ykaul-mini.tlv.redhat.com"
target="_blank"><span
class="gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il"><span
class="gmail-m_1130033870917217609gmail-il">mini</span></span>@<span
class="gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il">ykaul</span>-<span
class="gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il"><span
class="gmail-m_1130033870917217609gmail-il">mini</span></span>.tlv.redhat<wbr>.com</a>
(redacted)<br>
</div>
<div style="font-size:12.8px">then run a console to the
VM:</div>
<div><span style="font-size:12.8px">lagocli
--prefix-path
/dev/shm/run/current shell engine</span> </div>
<div><br>
</div>
<div>(or 'host0' for the host)</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> Most notably, Vdsm received
SIGTERM. Is this expected and part of the test?</div>
</blockquote>
<div><br>
</div>
<div>It's not.</div>
</div>
</div>
</div>
</blockquote>
<br>
I fooled myself. We have two hosts here. host0 was holding the VM up
until the suspend. Then Engine decided to resume the VM on the other
one, host1.<br>
While the VM was resuming, host0 experienced network issues, which
led to soft-fencing. That explains the mess on host0, even though it
is still<br>
unclear why host0 had network issues and heartbeat exceeded in the
first place.<br>
<br>
On host1 the waters are even darker. The VM was resumed ~02:36<br>
<br>
2017-01-11 02:36:04,700-05 INFO
[org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (default
task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] FINISH,
CreateVmVDSCommand, return: RestoringState, log id: 378da701<br>
2017-01-11 02:36:04,700-05 INFO
[org.ovirt.engine.core.bll.RunVmCommand] (default task-17)
[72c41f12-649b-4833-8485-44e8d20d2b49] Lock freed to object
'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea64<br>
2caf083=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]',
sharedLocks='null'}'<br>
2017-01-11 02:36:04,704-05 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Correlation
ID: 72c41f12-649b-4833-8485-44e8d20d2b<br>
49, Job ID: a93b571e-aed1-40e7-8d71-831f646255fb, Call Stack: null,
Custom Event ID: -1, Message: VM vm0 was started by
admin@internal-authz (Host: lago-basic-suite-master-host1).<br>
<br>
While well within the timeout limit (600s), the vm was still
restoring its state:<br>
<br>
2017-01-11 02:37:31,059-05 DEBUG
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand]
(DefaultQuartzScheduler8) [5582058d] START,
GetAllVmStatsVDSCommand(HostName = lago-basic-suite-master-host1,
VdsIdVDSCommandParametersBase:{runAsync='true',
hostId='0336661b-721f-4c55-9327-8fd2fd3a0542'}), log id: 1803c51b<br>
2017-01-11 02:37:31,059-05 DEBUG
[org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message]
(DefaultQuartzScheduler8) [5582058d] SEND<br>
destination:jms.topic.vdsm_requests<br>
reply-to:jms.topic.vdsm_responses<br>
content-length:103<br>
<br>
{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"id":"f2997c1d-49cc-4d2e-937c-e4910fbb75df"}^@<br>
2017-01-11 02:37:31,059-05 DEBUG
[org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
(DefaultQuartzScheduler8) [5582058d] Message sent: SEND<br>
destination:jms.topic.vdsm_requests<br>
content-length:103<br>
reply-to:jms.topic.vdsm_responses<br>
<br>
<JsonRpcRequest id: "f2997c1d-49cc-4d2e-937c-e4910fbb75df",
method: Host.getAllVmStats, params: {}><br>
2017-01-11 02:37:31,062-05 DEBUG
[org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message] (SSL
Stomp Reactor) [5e453618] MESSAGE<br>
content-length:829<br>
destination:jms.topic.vdsm_responses<br>
content-type:application/json<br>
subscription:fe930de2-aa67-4bc4-a34c-be22edd1623e<br>
<br>
{"jsonrpc": "2.0", "id":
"f2997c1d-49cc-4d2e-937c-e4910fbb75df",
"result": [{"username": "Unknown",
"displayInfo": [{"tlsPort": "-1",
"ipAddress": "192.168.201.4", "type": "vnc",
"port": "-1"},
{"tlsPort": "-1", "ipAddress":
"192.168.201.4", "type": "spice",
"port": "-1"}], "hash":
"-4942054084956770103", "acpiEnable":
"true", "displayIp": "192.168.201.4",
"guestFQDN": "", "vmId":
"42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0",
"timeOffset":
"0", "session": "Unknown",
"displaySecurePort": "-1", "displayPort":
"-1", "memUsage": "0", "guestIPs":
"", "kvmEnable": "true",
"displayType": "vnc", "cpuUser": "0.00",
"monitorResponse": "0",
"elapsedTime": "389", "vmType": "kvm",
"cpuSys": "0.00", "status":
"Restoring state", "guestCPUCount": -1, "appsList": [],
"clientIp":
"", "statusTime": "4295562220", "vmName":
"vm0"}]}^@<br>
2017-01-11 02:37:31,062-05 DEBUG
[org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
(ResponseWorker) [] Message received: {"jsonrpc": "2.0",
"id":
"f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result":
[{"username":
"Unknown", "displayInfo": [{"tlsPort": "-1",
"ipAddress":
"192.168.201.4", "type": "vnc", "port":
"-1"}, {"tlsPort": "-1",
"ipAddress": "192.168.201.4", "type": "spice",
"port": "-1"}],
"hash": "-4942054084956770103", "acpiEnable":
"true", "displayIp":
"192.168.201.4", "guestFQDN": "", "vmId":
"42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0",
"timeOffset":
"0", "session": "Unknown",
"displaySecurePort": "-1", "displayPort":
"-1", "memUsage": "0", "guestIPs":
"", "kvmEnable": "true",
"displayType": "vnc", "cpuUser": "0.00",
"monitorResponse": "0",
"elapsedTime": "389", "vmType": "kvm",
"cpuSys": "0.00", "status":
"Restoring state", "guestCPUCount": -1, "appsList": [],
"clientIp":
"", "statusTime": "4295562220", "vmName":
"vm0"}]}<br>
<br>
Then a failed migration attempt, which fails<br>
<br>
2017-01-11 02:46:07,826-05 INFO
[org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default
task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock Acquired to
object
'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM,
ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]',
sharedLocks='null'}'<br>
2017-01-11 02:46:07,827-05 WARN
[org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default
task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Validation of action
'MigrateVmToServer' failed for user admin@internal-authz. Reasons:
VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__HOST_STATUS__UP,ACTION_TYPE_FAILED_VDS_STATUS_ILLEGAL<br>
2017-01-11 02:46:07,827-05 INFO
[org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default
task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock freed to object
'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM,
ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]',
sharedLocks='null'}'<br>
<br>
which makes sense, host0 was being reinstalled (!?) meantime.<br>
<br>
After that I see no more evidence of 'vm0' in the logs. I couldn't
yet cross-correlate with the tests, perhaps the test failed
meantime?<br>
<br>
TL;DR: we need a reproducer to keep investigating this, so far I've
seen quite some suspicious activity but nothing really concrete.<br>
<br>
PS: last few lago failures I've seen are been triggered by
completely unrelated patches, so far I thik they are sporadic
failures<br>
<br>
<br>
<pre class="moz-signature" cols="72">--
Francesco Romani
Red Hat Engineering Virtualization R & D
IRC: fromani</pre>
</body>
</html>
--------------03A5C4A20FB28EE021E60761--