<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 11, 2017 at 12:42 PM, Francesco Romani <span dir="ltr">&lt;<a href="mailto:fromani@redhat.com" target="_blank">fromani@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  
  <div bgcolor="#FFFFFF" text="#000000"><span class="">

    <p><br>

    </p>

    <br>

    <div class="m_8059215392567603721moz-cite-prefix">On 01/11/2017 09:29 AM, Yaniv Kaul

      wrote:<br>

    </div>

    </span><blockquote type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra"><br>

          <div class="gmail_quote"><span class="">On Wed, Jan 11, 2017 at 10:26 AM,

            Francesco Romani <span dir="ltr">&lt;<a href="mailto:fromani@redhat.com" target="_blank">fromani@redhat.com</a>&gt;</span> wrote:<br>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div bgcolor="#FFFFFF">

                <p>Hi all<br>

                </p>

                <span class="m_8059215392567603721gmail-"> <br>

                  <div class="m_8059215392567603721gmail-m_-5785517619226444427moz-cite-prefix">On

                    01/11/2017 08:52 AM, Eyal Edri wrote:<br>

                  </div>

                  <blockquote type="cite">

                    <div dir="ltr">Adding Tomas from Virt.</div>

                    <div class="gmail_extra"><br>

                      <div class="gmail_quote">On Tue, Jan 10, 2017 at

                        10:54 AM, Piotr Kliczewski <span dir="ltr">&lt;<a href="mailto:piotr.kliczewski@gmail.com" target="_blank">piotr.kliczewski@gmail.com</a>&gt;</span>

                        wrote:<br>

                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>On

                            Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky

                            &lt;<a href="mailto:dbelenky@redhat.com" target="_blank">dbelenky@redhat.com</a>&gt;

                            wrote:<br>

                            &gt; Hi all,<br>

                            &gt;<br>

                            &gt; test-repo_ovirt_experimental_m<wbr>aster

                            (link to Jenkins) job failed on<br>

                            &gt; basic_sanity scenario.<br>

                            &gt; The job was triggered by <a href="https://gerrit.ovirt.org/#/c/69845/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/6<wbr>9845/</a><br>

                            &gt;<br>

                            &gt; From looking at the logs, it seems that

                            the reason is VDSM.<br>

                            &gt;<br>

                            &gt; In the VDSM log, i see the following

                            error:<br>

                            &gt;<br>

                            &gt; 2017-01-09 16:47:41,331 ERROR (JsonRpc

                            (StompReactor)) [vds.dispatcher] SSL<br>

                            &gt; error receiving from

                            &lt;yajsonrpc.betterAsyncore.Disp<wbr>atcher

                            connected (&#39;::1&#39;,<br>

                            &gt; 34942, 0, 0) at 0x36b95f0&gt;:

                            unexpected eof (betterAsyncore:119)<br>

                          </span></blockquote>

                      </div>

                    </div>

                  </blockquote>

                  <br>

                </span> Daniel, could you please remind me the jenkins

                link? I see something suspicious on the Vdsm log.<br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>Please use my live system:</div>

            </span><div style="font-size:12.8px">ssh <a href="mailto:mini@ykaul-mini.tlv.redhat.com" target="_blank"><span class="m_8059215392567603721gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il"><span class="m_8059215392567603721gmail-m_1130033870917217609gmail-il">mini</span></span>@<span class="m_8059215392567603721gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il">ykaul</span>-<span class="m_8059215392567603721gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il"><span class="m_8059215392567603721gmail-m_1130033870917217609gmail-il">mini</span></span>.tlv.redhat<wbr>.com</a>

              (redacted)<br>

            </div><span class="">

            <div style="font-size:12.8px">then run a console to the VM:</div>

            <div><span style="font-size:12.8px">lagocli --prefix-path

                /dev/shm/run/current shell engine</span> </div>

            <div><br>

            </div>

            <div>(or &#39;host0&#39; for the host)</div>

            <div><br>

            </div>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div bgcolor="#FFFFFF"> Most notably, Vdsm received

                SIGTERM. Is this expected and part of the test?</div>

            </blockquote>

            <div><br>

            </div>

            <div>It&#39;s not.</div>

          </span></div>

        </div>

      </div>

    </blockquote>

    <br>

    I fooled myself. We have two hosts here. host0 was holding the VM up

    until the suspend. Then Engine decided to resume the VM on the other

    one, host1.<br>

    While the VM was resuming, host0 experienced network issues, which

    led to soft-fencing. That explains the mess on host0, even though it

    is still<br>

    unclear why host0 had network issues and heartbeat exceeded in the

    first place.<br>

    <br>

    On host1 the waters are even darker. The VM was resumed ~02:36<br>

    <br>

    2017-01-11 02:36:04,700-05 INFO 

    [org.ovirt.engine.core.<wbr>vdsbroker.CreateVmVDSCommand] (default

    task-17) [72c41f12-649b-4833-8485-<wbr>44e8d20d2b49] FINISH,

    CreateVmVDSCommand, return: RestoringState, log id: 378da701<br>

    2017-01-11 02:36:04,700-05 INFO 

    [org.ovirt.engine.core.bll.<wbr>RunVmCommand] (default task-17)

    [72c41f12-649b-4833-8485-<wbr>44e8d20d2b49] Lock freed to object

    &#39;EngineLock:{exclusiveLocks=&#39;[<wbr>42860011-acc3-44d6-9ddf-dea64<br>

    2caf083=&lt;VM, ACTION_TYPE_FAILED_OBJECT_<wbr>LOCKED&gt;]&#39;,

    sharedLocks=&#39;null&#39;}&#39;<br>

    2017-01-11 02:36:04,704-05 INFO 

    [org.ovirt.engine.core.dal.<wbr>dbbroker.auditloghandling.<wbr>AuditLogDirector]

    (default task-17) [72c41f12-649b-4833-8485-<wbr>44e8d20d2b49] Correlation

    ID: 72c41f12-649b-4833-8485-<wbr>44e8d20d2b<br>

    49, Job ID: a93b571e-aed1-40e7-8d71-<wbr>831f646255fb, Call Stack: null,

    Custom Event ID: -1, Message: VM vm0 was started by

    admin@internal-authz (Host: lago-basic-suite-master-host1)<wbr>.<br>

    <br>

    While well within the timeout limit (600s), the vm was still

    restoring its state:<br></div></blockquote><div><br></div><div>This is an issue I (wrongfully, privately) complained to Michal - resume takes way lot more time than it used to - the &#39;lzop&#39; process decompressing the memory is taking ages to complete.</div><div>This is reproducible, regardless of the other issues.</div><div><br></div><div>I did not open a bug about it :(</div><div>Y.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    <br>

    2017-01-11 02:37:31,059-05 DEBUG

    [org.ovirt.engine.core.<wbr>vdsbroker.vdsbroker.<wbr>GetAllVmStatsVDSCommand]

    (DefaultQuartzScheduler8) [5582058d] START,

    GetAllVmStatsVDSCommand(<wbr>HostName = lago-basic-suite-master-host1,

    VdsIdVDSCommandParametersBase:<wbr>{runAsync=&#39;true&#39;,

    hostId=&#39;0336661b-721f-4c55-<wbr>9327-8fd2fd3a0542&#39;}), log id: 1803c51b<br>

    2017-01-11 02:37:31,059-05 DEBUG

    [org.ovirt.vdsm.jsonrpc.<wbr>client.reactors.stomp.impl.<wbr>Message]

    (DefaultQuartzScheduler8) [5582058d] SEND<br>

    destination:jms.topic.vdsm_<wbr>requests<br>

    reply-to:jms.topic.vdsm_<wbr>responses<br>

    content-length:103<br>

    <br>

{&quot;jsonrpc&quot;:&quot;2.0&quot;,&quot;method&quot;:&quot;<wbr>Host.getAllVmStats&quot;,&quot;params&quot;:{<wbr>},&quot;id&quot;:&quot;f2997c1d-49cc-4d2e-<wbr>937c-e4910fbb75df&quot;}^@<br>

    2017-01-11 02:37:31,059-05 DEBUG

    [org.ovirt.vdsm.jsonrpc.<wbr>client.reactors.stomp.<wbr>StompCommonClient]

    (DefaultQuartzScheduler8) [5582058d] Message sent: SEND<br>

    destination:jms.topic.vdsm_<wbr>requests<br>

    content-length:103<br>

    reply-to:jms.topic.vdsm_<wbr>responses<br>

    <br>

    &lt;JsonRpcRequest id: &quot;f2997c1d-49cc-4d2e-937c-<wbr>e4910fbb75df&quot;,

    method: Host.getAllVmStats, params: {}&gt;<br>

    2017-01-11 02:37:31,062-05 DEBUG

    [org.ovirt.vdsm.jsonrpc.<wbr>client.reactors.stomp.impl.<wbr>Message] (SSL

    Stomp Reactor) [5e453618] MESSAGE<br>

    content-length:829<br>

    destination:jms.topic.vdsm_<wbr>responses<br>

    content-type:application/json<br>

    subscription:fe930de2-aa67-<wbr>4bc4-a34c-be22edd1623e<br>

    <br>

    {&quot;jsonrpc&quot;: &quot;2.0&quot;, &quot;id&quot;: &quot;f2997c1d-49cc-4d2e-937c-<wbr>e4910fbb75df&quot;,

    &quot;result&quot;: [{&quot;username&quot;: &quot;Unknown&quot;, &quot;displayInfo&quot;: [{&quot;tlsPort&quot;: &quot;-1&quot;,

    &quot;ipAddress&quot;: &quot;192.168.201.4&quot;, &quot;type&quot;: &quot;vnc&quot;, &quot;port&quot;: &quot;-1&quot;},

    {&quot;tlsPort&quot;: &quot;-1&quot;, &quot;ipAddress&quot;: &quot;192.168.201.4&quot;, &quot;type&quot;: &quot;spice&quot;,

    &quot;port&quot;: &quot;-1&quot;}], &quot;hash&quot;: &quot;-4942054084956770103&quot;, &quot;acpiEnable&quot;:

    &quot;true&quot;, &quot;displayIp&quot;: &quot;192.168.201.4&quot;, &quot;guestFQDN&quot;: &quot;&quot;, &quot;vmId&quot;:

    &quot;42860011-acc3-44d6-9ddf-<wbr>dea642caf083&quot;, &quot;pid&quot;: &quot;0&quot;, &quot;timeOffset&quot;:

    &quot;0&quot;, &quot;session&quot;: &quot;Unknown&quot;, &quot;displaySecurePort&quot;: &quot;-1&quot;, &quot;displayPort&quot;:

    &quot;-1&quot;, &quot;memUsage&quot;: &quot;0&quot;, &quot;guestIPs&quot;: &quot;&quot;, &quot;kvmEnable&quot;: &quot;true&quot;,

    &quot;displayType&quot;: &quot;vnc&quot;, &quot;cpuUser&quot;: &quot;0.00&quot;, &quot;monitorResponse&quot;: &quot;0&quot;,

    &quot;elapsedTime&quot;: &quot;389&quot;, &quot;vmType&quot;: &quot;kvm&quot;, &quot;cpuSys&quot;: &quot;0.00&quot;, &quot;status&quot;:

    &quot;Restoring state&quot;, &quot;guestCPUCount&quot;: -1, &quot;appsList&quot;: [], &quot;clientIp&quot;:

    &quot;&quot;, &quot;statusTime&quot;: &quot;4295562220&quot;, &quot;vmName&quot;: &quot;vm0&quot;}]}^@<br>

    2017-01-11 02:37:31,062-05 DEBUG

    [org.ovirt.vdsm.jsonrpc.<wbr>client.internal.<wbr>ResponseWorker]

    (ResponseWorker) [] Message received: {&quot;jsonrpc&quot;: &quot;2.0&quot;, &quot;id&quot;:

    &quot;f2997c1d-49cc-4d2e-937c-<wbr>e4910fbb75df&quot;, &quot;result&quot;: [{&quot;username&quot;:

    &quot;Unknown&quot;, &quot;displayInfo&quot;: [{&quot;tlsPort&quot;: &quot;-1&quot;, &quot;ipAddress&quot;:

    &quot;192.168.201.4&quot;, &quot;type&quot;: &quot;vnc&quot;, &quot;port&quot;: &quot;-1&quot;}, {&quot;tlsPort&quot;: &quot;-1&quot;,

    &quot;ipAddress&quot;: &quot;192.168.201.4&quot;, &quot;type&quot;: &quot;spice&quot;, &quot;port&quot;: &quot;-1&quot;}],

    &quot;hash&quot;: &quot;-4942054084956770103&quot;, &quot;acpiEnable&quot;: &quot;true&quot;, &quot;displayIp&quot;:

    &quot;192.168.201.4&quot;, &quot;guestFQDN&quot;: &quot;&quot;, &quot;vmId&quot;:

    &quot;42860011-acc3-44d6-9ddf-<wbr>dea642caf083&quot;, &quot;pid&quot;: &quot;0&quot;, &quot;timeOffset&quot;:

    &quot;0&quot;, &quot;session&quot;: &quot;Unknown&quot;, &quot;displaySecurePort&quot;: &quot;-1&quot;, &quot;displayPort&quot;:

    &quot;-1&quot;, &quot;memUsage&quot;: &quot;0&quot;, &quot;guestIPs&quot;: &quot;&quot;, &quot;kvmEnable&quot;: &quot;true&quot;,

    &quot;displayType&quot;: &quot;vnc&quot;, &quot;cpuUser&quot;: &quot;0.00&quot;, &quot;monitorResponse&quot;: &quot;0&quot;,

    &quot;elapsedTime&quot;: &quot;389&quot;, &quot;vmType&quot;: &quot;kvm&quot;, &quot;cpuSys&quot;: &quot;0.00&quot;, &quot;status&quot;:

    &quot;Restoring state&quot;, &quot;guestCPUCount&quot;: -1, &quot;appsList&quot;: [], &quot;clientIp&quot;:

    &quot;&quot;, &quot;statusTime&quot;: &quot;4295562220&quot;, &quot;vmName&quot;: &quot;vm0&quot;}]}<br>

    <br>

    Then a failed migration attempt, which fails<br>

    <br>

    2017-01-11 02:46:07,826-05 INFO 

    [org.ovirt.engine.core.bll.<wbr>MigrateVmToServerCommand] (default

    task-4) [27938262-f5fe-41ff-a662-<wbr>5177a3a25ada] Lock Acquired to

    object

&#39;EngineLock:{exclusiveLocks=&#39;[<wbr>42860011-acc3-44d6-9ddf-<wbr>dea642caf083=&lt;VM,

    ACTION_TYPE_FAILED_VM_IS_<wbr>BEING_MIGRATED$VmName vm0&gt;]&#39;,

    sharedLocks=&#39;null&#39;}&#39;<br>

    2017-01-11 02:46:07,827-05 WARN 

    [org.ovirt.engine.core.bll.<wbr>MigrateVmToServerCommand] (default

    task-4) [27938262-f5fe-41ff-a662-<wbr>5177a3a25ada] Validation of action

    &#39;MigrateVmToServer&#39; failed for user admin@internal-authz. Reasons:

VAR__ACTION__MIGRATE,VAR__<wbr>TYPE__VM,VAR__HOST_STATUS__UP,<wbr>ACTION_TYPE_FAILED_VDS_STATUS_<wbr>ILLEGAL<br>

    2017-01-11 02:46:07,827-05 INFO 

    [org.ovirt.engine.core.bll.<wbr>MigrateVmToServerCommand] (default

    task-4) [27938262-f5fe-41ff-a662-<wbr>5177a3a25ada] Lock freed to object

&#39;EngineLock:{exclusiveLocks=&#39;[<wbr>42860011-acc3-44d6-9ddf-<wbr>dea642caf083=&lt;VM,

    ACTION_TYPE_FAILED_VM_IS_<wbr>BEING_MIGRATED$VmName vm0&gt;]&#39;,

    sharedLocks=&#39;null&#39;}&#39;<br>

    <br>

    which makes sense, host0 was being reinstalled (!?) meantime.<br>

    <br>

    After that I see no more evidence of &#39;vm0&#39; in the logs. I couldn&#39;t

    yet cross-correlate with the tests, perhaps the test failed

    meantime?<br>

    <br>

    TL;DR: we need a reproducer to keep investigating this, so far I&#39;ve

    seen quite some suspicious activity but nothing really concrete.<br>

    <br>

    PS: last few lago failures I&#39;ve seen are been triggered by

    completely unrelated patches, so far I thik they are sporadic

    failures<span class=""><br>

    <br>

    <br>

    <pre class="m_8059215392567603721moz-signature" cols="72">-- 

Francesco Romani

Red Hat Engineering Virtualization R &amp; D

IRC: fromani</pre>

  </span></div>


</blockquote></div><br></div></div>