<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 11, 2017 at 12:42 PM, Francesco Romani <span dir="ltr"><<a href="mailto:fromani@redhat.com" target="_blank">fromani@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><span class="">
<p><br>
</p>
<br>
<div class="m_8059215392567603721moz-cite-prefix">On 01/11/2017 09:29 AM, Yaniv Kaul
wrote:<br>
</div>
</span><blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote"><span class="">On Wed, Jan 11, 2017 at 10:26 AM,
Francesco Romani <span dir="ltr"><<a href="mailto:fromani@redhat.com" target="_blank">fromani@redhat.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>Hi all<br>
</p>
<span class="m_8059215392567603721gmail-"> <br>
<div class="m_8059215392567603721gmail-m_-5785517619226444427moz-cite-prefix">On
01/11/2017 08:52 AM, Eyal Edri wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Adding Tomas from Virt.</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Tue, Jan 10, 2017 at
10:54 AM, Piotr Kliczewski <span dir="ltr"><<a href="mailto:piotr.kliczewski@gmail.com" target="_blank">piotr.kliczewski@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>On
Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky
<<a href="mailto:dbelenky@redhat.com" target="_blank">dbelenky@redhat.com</a>>
wrote:<br>
> Hi all,<br>
><br>
> test-repo_ovirt_experimental_m<wbr>aster
(link to Jenkins) job failed on<br>
> basic_sanity scenario.<br>
> The job was triggered by <a href="https://gerrit.ovirt.org/#/c/69845/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/6<wbr>9845/</a><br>
><br>
> From looking at the logs, it seems that
the reason is VDSM.<br>
><br>
> In the VDSM log, i see the following
error:<br>
><br>
> 2017-01-09 16:47:41,331 ERROR (JsonRpc
(StompReactor)) [vds.dispatcher] SSL<br>
> error receiving from
<yajsonrpc.betterAsyncore.Disp<wbr>atcher
connected ('::1',<br>
> 34942, 0, 0) at 0x36b95f0>:
unexpected eof (betterAsyncore:119)<br>
</span></blockquote>
</div>
</div>
</blockquote>
<br>
</span> Daniel, could you please remind me the jenkins
link? I see something suspicious on the Vdsm log.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Please use my live system:</div>
</span><div style="font-size:12.8px">ssh <a href="mailto:mini@ykaul-mini.tlv.redhat.com" target="_blank"><span class="m_8059215392567603721gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il"><span class="m_8059215392567603721gmail-m_1130033870917217609gmail-il">mini</span></span>@<span class="m_8059215392567603721gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il">ykaul</span>-<span class="m_8059215392567603721gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il"><span class="m_8059215392567603721gmail-m_1130033870917217609gmail-il">mini</span></span>.tlv.redhat<wbr>.com</a>
(redacted)<br>
</div><span class="">
<div style="font-size:12.8px">then run a console to the VM:</div>
<div><span style="font-size:12.8px">lagocli --prefix-path
/dev/shm/run/current shell engine</span> </div>
<div><br>
</div>
<div>(or 'host0' for the host)</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> Most notably, Vdsm received
SIGTERM. Is this expected and part of the test?</div>
</blockquote>
<div><br>
</div>
<div>It's not.</div>
</span></div>
</div>
</div>
</blockquote>
<br>
I fooled myself. We have two hosts here. host0 was holding the VM up
until the suspend. Then Engine decided to resume the VM on the other
one, host1.<br>
While the VM was resuming, host0 experienced network issues, which
led to soft-fencing. That explains the mess on host0, even though it
is still<br>
unclear why host0 had network issues and heartbeat exceeded in the
first place.<br>
<br>
On host1 the waters are even darker. The VM was resumed ~02:36<br>
<br>
2017-01-11 02:36:04,700-05 INFO
[org.ovirt.engine.core.<wbr>vdsbroker.CreateVmVDSCommand] (default
task-17) [72c41f12-649b-4833-8485-<wbr>44e8d20d2b49] FINISH,
CreateVmVDSCommand, return: RestoringState, log id: 378da701<br>
2017-01-11 02:36:04,700-05 INFO
[org.ovirt.engine.core.bll.<wbr>RunVmCommand] (default task-17)
[72c41f12-649b-4833-8485-<wbr>44e8d20d2b49] Lock freed to object
'EngineLock:{exclusiveLocks='[<wbr>42860011-acc3-44d6-9ddf-dea64<br>
2caf083=<VM, ACTION_TYPE_FAILED_OBJECT_<wbr>LOCKED>]',
sharedLocks='null'}'<br>
2017-01-11 02:36:04,704-05 INFO
[org.ovirt.engine.core.dal.<wbr>dbbroker.auditloghandling.<wbr>AuditLogDirector]
(default task-17) [72c41f12-649b-4833-8485-<wbr>44e8d20d2b49] Correlation
ID: 72c41f12-649b-4833-8485-<wbr>44e8d20d2b<br>
49, Job ID: a93b571e-aed1-40e7-8d71-<wbr>831f646255fb, Call Stack: null,
Custom Event ID: -1, Message: VM vm0 was started by
admin@internal-authz (Host: lago-basic-suite-master-host1)<wbr>.<br>
<br>
While well within the timeout limit (600s), the vm was still
restoring its state:<br></div></blockquote><div><br></div><div>This is an issue I (wrongfully, privately) complained to Michal - resume takes way lot more time than it used to - the 'lzop' process decompressing the memory is taking ages to complete.</div><div>This is reproducible, regardless of the other issues.</div><div><br></div><div>I did not open a bug about it :(</div><div>Y.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
<br>
2017-01-11 02:37:31,059-05 DEBUG
[org.ovirt.engine.core.<wbr>vdsbroker.vdsbroker.<wbr>GetAllVmStatsVDSCommand]
(DefaultQuartzScheduler8) [5582058d] START,
GetAllVmStatsVDSCommand(<wbr>HostName = lago-basic-suite-master-host1,
VdsIdVDSCommandParametersBase:<wbr>{runAsync='true',
hostId='0336661b-721f-4c55-<wbr>9327-8fd2fd3a0542'}), log id: 1803c51b<br>
2017-01-11 02:37:31,059-05 DEBUG
[org.ovirt.vdsm.jsonrpc.<wbr>client.reactors.stomp.impl.<wbr>Message]
(DefaultQuartzScheduler8) [5582058d] SEND<br>
destination:jms.topic.vdsm_<wbr>requests<br>
reply-to:jms.topic.vdsm_<wbr>responses<br>
content-length:103<br>
<br>
{"jsonrpc":"2.0","method":"<wbr>Host.getAllVmStats","params":{<wbr>},"id":"f2997c1d-49cc-4d2e-<wbr>937c-e4910fbb75df"}^@<br>
2017-01-11 02:37:31,059-05 DEBUG
[org.ovirt.vdsm.jsonrpc.<wbr>client.reactors.stomp.<wbr>StompCommonClient]
(DefaultQuartzScheduler8) [5582058d] Message sent: SEND<br>
destination:jms.topic.vdsm_<wbr>requests<br>
content-length:103<br>
reply-to:jms.topic.vdsm_<wbr>responses<br>
<br>
<JsonRpcRequest id: "f2997c1d-49cc-4d2e-937c-<wbr>e4910fbb75df",
method: Host.getAllVmStats, params: {}><br>
2017-01-11 02:37:31,062-05 DEBUG
[org.ovirt.vdsm.jsonrpc.<wbr>client.reactors.stomp.impl.<wbr>Message] (SSL
Stomp Reactor) [5e453618] MESSAGE<br>
content-length:829<br>
destination:jms.topic.vdsm_<wbr>responses<br>
content-type:application/json<br>
subscription:fe930de2-aa67-<wbr>4bc4-a34c-be22edd1623e<br>
<br>
{"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-<wbr>e4910fbb75df",
"result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1",
"ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"},
{"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "spice",
"port": "-1"}], "hash": "-4942054084956770103", "acpiEnable":
"true", "displayIp": "192.168.201.4", "guestFQDN": "", "vmId":
"42860011-acc3-44d6-9ddf-<wbr>dea642caf083", "pid": "0", "timeOffset":
"0", "session": "Unknown", "displaySecurePort": "-1", "displayPort":
"-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true",
"displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0",
"elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status":
"Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp":
"", "statusTime": "4295562220", "vmName": "vm0"}]}^@<br>
2017-01-11 02:37:31,062-05 DEBUG
[org.ovirt.vdsm.jsonrpc.<wbr>client.internal.<wbr>ResponseWorker]
(ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id":
"f2997c1d-49cc-4d2e-937c-<wbr>e4910fbb75df", "result": [{"username":
"Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress":
"192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": "-1",
"ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}],
"hash": "-4942054084956770103", "acpiEnable": "true", "displayIp":
"192.168.201.4", "guestFQDN": "", "vmId":
"42860011-acc3-44d6-9ddf-<wbr>dea642caf083", "pid": "0", "timeOffset":
"0", "session": "Unknown", "displaySecurePort": "-1", "displayPort":
"-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true",
"displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0",
"elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status":
"Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp":
"", "statusTime": "4295562220", "vmName": "vm0"}]}<br>
<br>
Then a failed migration attempt, which fails<br>
<br>
2017-01-11 02:46:07,826-05 INFO
[org.ovirt.engine.core.bll.<wbr>MigrateVmToServerCommand] (default
task-4) [27938262-f5fe-41ff-a662-<wbr>5177a3a25ada] Lock Acquired to
object
'EngineLock:{exclusiveLocks='[<wbr>42860011-acc3-44d6-9ddf-<wbr>dea642caf083=<VM,
ACTION_TYPE_FAILED_VM_IS_<wbr>BEING_MIGRATED$VmName vm0>]',
sharedLocks='null'}'<br>
2017-01-11 02:46:07,827-05 WARN
[org.ovirt.engine.core.bll.<wbr>MigrateVmToServerCommand] (default
task-4) [27938262-f5fe-41ff-a662-<wbr>5177a3a25ada] Validation of action
'MigrateVmToServer' failed for user admin@internal-authz. Reasons:
VAR__ACTION__MIGRATE,VAR__<wbr>TYPE__VM,VAR__HOST_STATUS__UP,<wbr>ACTION_TYPE_FAILED_VDS_STATUS_<wbr>ILLEGAL<br>
2017-01-11 02:46:07,827-05 INFO
[org.ovirt.engine.core.bll.<wbr>MigrateVmToServerCommand] (default
task-4) [27938262-f5fe-41ff-a662-<wbr>5177a3a25ada] Lock freed to object
'EngineLock:{exclusiveLocks='[<wbr>42860011-acc3-44d6-9ddf-<wbr>dea642caf083=<VM,
ACTION_TYPE_FAILED_VM_IS_<wbr>BEING_MIGRATED$VmName vm0>]',
sharedLocks='null'}'<br>
<br>
which makes sense, host0 was being reinstalled (!?) meantime.<br>
<br>
After that I see no more evidence of 'vm0' in the logs. I couldn't
yet cross-correlate with the tests, perhaps the test failed
meantime?<br>
<br>
TL;DR: we need a reproducer to keep investigating this, so far I've
seen quite some suspicious activity but nothing really concrete.<br>
<br>
PS: last few lago failures I've seen are been triggered by
completely unrelated patches, so far I thik they are sporadic
failures<span class=""><br>
<br>
<br>
<pre class="m_8059215392567603721moz-signature" cols="72">--
Francesco Romani
Red Hat Engineering Virtualization R & D
IRC: fromani</pre>
</span></div>
</blockquote></div><br></div></div>