<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Dec 11, 2017 at 6:44 PM, Milan Zamazal <span dir="ltr">&lt;<a href="mailto:mzamazal@redhat.com" target="_blank">mzamazal@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Michal Skrivanek &lt;<a href="mailto:michal.skrivanek@redhat.com">michal.skrivanek@redhat.com</a>&gt; writes:<br>
<br>
&gt; Milan,<br>
&gt;<br>
&gt; log [1], VM b3962e5c-08e3-444e-910e-<wbr>ea675fa1a5c7<br>
&gt; migration away finished at 2017-12-05 06:26:24,515-0500<br>
&gt; incoming migration of the same VM back, at 2017-12-05 06:26:46,614-0500<br>
&gt;<br>
&gt; seems to me the migration away didn’t really properly clean up the VM. Milane,<br>
&gt; can you check that if logs matches? Perhaps orphaned libvirt’s xml?<br>
<br>
</span>It seems that, for unknown reasons, Engine didn’t call Destroy on the<br>
source after the first migration.  So while the VM was no longer running<br>
there, it was still present in Vdsm (in Down status) and thus the<br>
following create call was rejected in API.<br>
<br>
There are weird messages in engine.log during the first migration such<br>
as the following ones, I don’t know whether they are harmless or not:<br>
<br>
  2017-12-05 06:26:19,799-05 INFO  [org.ovirt.engine.core.<wbr>vdsbroker.monitoring.<wbr>VmAnalyzer] (EE-ManagedThreadFactory-<wbr>engineScheduled-Thread-89) [] VM &#39;b3962e5c-08e3-444e-910e-<wbr>ea675fa1a5c7&#39;(vm0) was unexpectedly detected as &#39;MigratingTo&#39; on VDS &#39;56bf5bdd-8a95-4b7a-9c72-<wbr>7206d6b59e38&#39;(lago-basic-<wbr>suite-master-host-1) (expected on &#39;104bd4da-24bb-4368-a5ca-<wbr>21a465aca70e&#39;)<br>
  2017-12-05 06:26:19,799-05 INFO  [org.ovirt.engine.core.<wbr>vdsbroker.monitoring.<wbr>VmAnalyzer] (EE-ManagedThreadFactory-<wbr>engineScheduled-Thread-89) [] VM &#39;b3962e5c-08e3-444e-910e-<wbr>ea675fa1a5c7&#39; is migrating to VDS &#39;56bf5bdd-8a95-4b7a-9c72-<wbr>7206d6b59e38&#39;(lago-basic-<wbr>suite-master-host-1) ignoring it in the refresh until migration is done<br></blockquote><div><br></div><div>These logs are completely harmless - they only indicate that the VM&#39;s run_on_vds still points to the source host and the hand-over was not done yet.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
The migration completion event was successfully obtained by Engine:<br>
<br>
  2017-12-05 06:26:24,542-05 INFO  [org.ovirt.engine.core.dal.<wbr>dbbroker.auditloghandling.<wbr>AuditLogDirector] (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM: vm0, Source: lago-basic-suite-master-host-<wbr>0, Destination: lago-basic-suite-master-host-<wbr>1, Duration: 6 seconds, Total: 6 seconds, Actual downtime: 73ms)<br>
<span class=""><br>
&gt; Otherwise it would need reproduction in CI and some more logs….<br>
&gt;<br>
&gt; [1]<br>
&gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4278/artifact/exported-<wbr>artifacts/basic-suit-master-<wbr>el7/test_logs/basic-suite-<wbr>master/post-006_migrations.py/<wbr>lago-basic-suite-master-host-<wbr>0/_var_log/vdsm/vdsm.log</a><br>
&gt; &lt;<a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4278/artifact/exported-<wbr>artifacts/basic-suit-master-<wbr>el7/test_logs/basic-suite-<wbr>master/post-006_migrations.py/<wbr>lago-basic-suite-master-host-<wbr>0/_var_log/vdsm/vdsm.log</a>&gt;<br>
&gt;<br>
&gt;&gt; On 5 Dec 2017, at 13:26, Dafna Ron &lt;<a href="mailto:dron@redhat.com">dron@redhat.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; Hi,<br>
&gt;&gt; We had a failure for test 006_migrations.migrate_vm on master.<br>
&gt;&gt;<br>
&gt;&gt; There was a libvirt disruption in the migration src  and after that vdsm reported the migration as failed because the vm already exists which makes me suspect a split bran case.<br>
&gt;&gt; The patch reported has nothing to do with this issue and I think we simply stumbled on a race condition which can cause a split brain.<br>
&gt;&gt; Please note that there are several metrics related issues reported in vdsm logs as well.<br>
&gt;&gt;<br>
&gt;&gt; Link and headline of suspected patches:<br>
&gt;&gt;<br>
&gt;&gt; Not related<br>
&gt;&gt;<br>
&gt;&gt; Link to Job:<br>
&gt;&gt;<br>
&gt;&gt;<br>
</span>&gt;&gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4278/</a> &lt;<a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4278/</a>&gt;<br>
&gt;&gt;<br>
&gt;&gt; Link to all logs:<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4278/artifact/exported-<wbr>artifacts/basic-suit-master-<wbr>el7/test_logs/basic-suite-<wbr>master/post-006_migrations.py/</a> &lt;<a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4278/artifact/exported-<wbr>artifacts/basic-suit-master-<wbr>el7/test_logs/basic-suite-<wbr>master/post-006_migrations.py/</a><wbr>&gt;<br>
<div class="HOEnZb"><div class="h5">&gt;&gt;<br>
&gt;&gt; (Relevant) error snippet from the log:<br>
&gt;&gt; &lt;error&gt;<br>
&gt;&gt;<br>
&gt;&gt; Engine:<br>
&gt;&gt; 2017-12-05 06:26:48,546-05 ERROR [org.ovirt.engine.core.dal.<wbr>dbbroker.auditloghandling.<wbr>AuditLogDirector] (EE-ManagedThreadFactory-<wbr>engine-Thread-385) [] EVENT_ID: VM_MIGRATION_TO_SERVER_FAILED(<wbr>120), Migration failed  (VM: vm0, Source: lago-<br>
&gt;&gt; basic-suite-master-host-1, Destination: lago-basic-suite-master-host-<wbr>0).<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; dst:<br>
&gt;&gt;<br>
&gt;&gt; 2017-12-05 06:26:46,615-0500 WARN  (jsonrpc/6) [vds] vm b3962e5c-08e3-444e-910e-<wbr>ea675fa1a5c7 already exists (API:179)<br>
&gt;&gt; 2017-12-05 06:26:46,615-0500 ERROR (jsonrpc/6) [api] FINISH create error=Virtual machine already exists (api:124)<br>
&gt;&gt; Traceback (most recent call last):<br>
&gt;&gt;   File &quot;/usr/lib/python2.7/site-<wbr>packages/vdsm/common/api.py&quot;, line 117, in method<br>
&gt;&gt;     ret = func(*args, **kwargs)<br>
&gt;&gt;   File &quot;/usr/lib/python2.7/site-<wbr>packages/vdsm/API.py&quot;, line 180, in create<br>
&gt;&gt;     raise exception.VMExists()<br>
&gt;&gt; VMExists: Virtual machine already exists<br>
&gt;&gt; 2017-12-05 06:26:46,620-0500 INFO  (jsonrpc/6) [api.virt] FINISH create return={&#39;status&#39;: {&#39;message&#39;: &#39;Virtual machine already exists&#39;, &#39;code&#39;: 4}} from=::ffff:192.168.201.3,<wbr>50394 (api:52)<br>
&gt;&gt; 2017-12-05 06:26:46,620-0500 INFO  (jsonrpc/6) [api.virt] FINISH migrationCreate return={&#39;status&#39;: {&#39;message&#39;: &#39;Virtual machine already exists&#39;, &#39;code&#39;: 4}} from=::ffff:192.168.201.3,<wbr>50394 (api:52)<br>
&gt;&gt; 2017-12-05 06:26:46,620-0500 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.migrationCreate failed (error 4) in 0.03 seconds (__init__:573)<br>
&gt;&gt; 2017-12-05 06:26:46,624-0500 INFO  (jsonrpc/2) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:192.168.201.3,<wbr>50394 (api:46)<br>
&gt;&gt; 2017-12-05 06:26:46,624-0500 INFO  (jsonrpc/2) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) Release VM resources (vm:4967)<br>
&gt;&gt; 2017-12-05 06:26:46,625-0500 WARN  (jsonrpc/2) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) trying to set state to Powering down when already Down (vm:575)<br>
&gt;&gt; 2017-12-05 06:26:46,625-0500 INFO  (jsonrpc/2) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) Stopping connection (guestagent:435)<br>
&gt;&gt; 2017-12-05 06:26:46,625-0500 INFO  (jsonrpc/2) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) _destroyVmGraceful attempt #0 (vm:5004)<br>
&gt;&gt; 2017-12-05 06:26:46,626-0500 WARN  (jsonrpc/2) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) VM &#39;b3962e5c-08e3-444e-910e-<wbr>ea675fa1a5c7&#39; couldn&#39;t be destroyed in libvirt: Requested operation is not valid: domain is not running (vm<br>
&gt;&gt; :5025)<br>
&gt;&gt; 2017-12-05 06:26:46,627-0500 INFO  (jsonrpc/2) [vdsm.api] START teardownImage(sdUUID=&#39;<wbr>952bb427-b88c-4fbe-99ef-<wbr>49970d3aaf70&#39;, spUUID=&#39;9dcfeaaf-96b7-4e26-<wbr>a327-5570e0e39261&#39;, imgUUID=&#39;e6eadbae-ec7a-48f4-<wbr>8832-64a622a12bef&#39;, volUUID=None) from<br>
&gt;&gt; =::ffff:192.168.201.3,50394, task_id=2da93725-5533-4354-<wbr>a369-751eb44f9ef2 (api:46)<br>
&gt;&gt;<br>
&gt;&gt; scr<br>
&gt;&gt;<br>
&gt;&gt; 2017-12-05 06:26:46,623-0500 ERROR (migsrc/b3962e5c) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) migration destination error: Virtual machine already exists (migration:290)<br>
&gt;&gt;<br>
&gt;&gt; disruption on dst:<br>
&gt;&gt;<br>
&gt;&gt; 2017-12-05 06:20:04,662-0500 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call VM.shutdown succeeded in 0.00 seconds (__init__:573)<br>
&gt;&gt; 2017-12-05 06:20:04,676-0500 ERROR (Thread-1) [root] Shutdown by QEMU Guest Agent failed (vm:5097)<br>
&gt;&gt; Traceback (most recent call last):<br>
&gt;&gt;   File &quot;/usr/lib/python2.7/site-<wbr>packages/vdsm/virt/vm.py&quot;, line 5088, in qemuGuestAgentShutdown<br>
&gt;&gt;     self._dom.shutdownFlags(<wbr>libvirt.VIR_DOMAIN_SHUTDOWN_<wbr>GUEST_AGENT)<br>
&gt;&gt;   File &quot;/usr/lib/python2.7/site-<wbr>packages/vdsm/virt/virdomain.<wbr>py&quot;, line 98, in f<br>
&gt;&gt;     ret = attr(*args, **kwargs)<br>
&gt;&gt;   File &quot;/usr/lib/python2.7/site-<wbr>packages/vdsm/<wbr>libvirtconnection.py&quot;, line 126, in wrapper<br>
&gt;&gt;     ret = f(*args, **kwargs)<br>
&gt;&gt;   File &quot;/usr/lib/python2.7/site-<wbr>packages/vdsm/utils.py&quot;, line 512, in wrapper<br>
&gt;&gt;     return func(inst, *args, **kwargs)<br>
&gt;&gt;   File &quot;/usr/lib64/python2.7/site-<wbr>packages/libvirt.py&quot;, line 2403, in shutdownFlags<br>
&gt;&gt;     if ret == -1: raise libvirtError (&#39;virDomainShutdownFlags() failed&#39;, dom=self)<br>
&gt;&gt; libvirtError: Guest agent is not responding: QEMU guest agent is not connected<br>
&gt;&gt; 2017-12-05 06:20:04,697-0500 INFO  (libvirt/events) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) block threshold 1 exceeded on &#39;vda&#39; (/rhev/data-center/mnt/<wbr>blockSD/952bb427-b88c-4fbe-<wbr>99ef-49970d3aaf70/images/<wbr>e6eadbae-ec7a-48f4-<br>
&gt;&gt; 8832-64a622a12bef/1f063c73-<wbr>f5cd-44e8-a84f-7810857f82df) (drivemonitor:162)<br>
&gt;&gt; 2017-12-05 06:20:04,698-0500 INFO  (libvirt/events) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) drive &#39;vda&#39; threshold exceeded (storage:872)<br>
&gt;&gt; 2017-12-05 06:20:05,889-0500 INFO  (jsonrpc/7) [api.host] START getAllVmStats() from=::1,41118 (api:46)<br>
&gt;&gt; 2017-12-05 06:20:05,891-0500 INFO  (jsonrpc/7) [api.host] FINISH getAllVmStats return={&#39;status&#39;: {&#39;message&#39;: &#39;Done&#39;, &#39;code&#39;: 0}, &#39;statsList&#39;: (suppressed)} from=::1,41118 (api:52)<br>
&gt;&gt; 2017-12-05 06:20:05,892-0500 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:573)<br>
&gt;&gt; 2017-12-05 06:20:07,466-0500 INFO  (libvirt/events) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) underlying process disconnected (vm:1024)<br>
&gt;&gt; 2017-12-05 06:20:07,469-0500 INFO  (libvirt/events) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) Release VM resources (vm:4967)<br>
&gt;&gt; 2017-12-05 06:20:07,469-0500 INFO  (libvirt/events) [virt.vm] (vmId=&#39;b3962e5c-08e3-444e-<wbr>910e-ea675fa1a5c7&#39;) Stopping connection (guestagent:435)<br>
&gt;&gt; 2017-12-05 06:20:07,469-0500 INFO  (libvirt/events) [vdsm.api] START teardownImage(sdUUID=&#39;<wbr>952bb427-b88c-4fbe-99ef-<wbr>49970d3aaf70&#39;, spUUID=&#39;9dcfeaaf-96b7-4e26-<wbr>a327-5570e0e39261&#39;, imgUUID=&#39;e6eadbae-ec7a-48f4-<wbr>8832-64a622a12bef&#39;, volUUID=None)<br>
&gt;&gt;  from=internal, task_id=9efedf46-d3be-4e41-<wbr>b7f9-a074ed6344f6 (api:46)<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; &lt;/error&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; ______________________________<wbr>_________________<br>
&gt;&gt; Devel mailing list<br>
&gt;&gt; <a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>
&gt;&gt; <a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/devel</a><br>
______________________________<wbr>_________________<br>
Devel mailing list<br>
<a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>
<a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/devel</a></div></div></blockquote></div><br></div></div>