Re: [ovirt-users] upgrade from 3.5 to 3.6 causing problems with migration

9 Nov 2015

      This is a multi-part message in MIME format.
--------------000709050608050206080203
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Hi Shmuel,

Thanks very much for looking into my problem!

I installed 3.6 on the engine.  I rebooted the engine.
The 3 hosts were still running vdsm from 3.5.  I checked back in the yum 
log, and it was 4.16.26-0.el7.
On the first host upgrade (virt1), I made a mistake.  After bringing in 
the 3.6 repo, I upgraded the packages with just "yum update". However, I 
know that I should have put the host into maintenance mode first.  After 
the updates installed, I put the host into maintenance mode, and it 
migrated the VMs off, during which I saw more than one failed VM migration.
I'm willing to accept the failures there because I should have put the 
host into maintenance mode first.  Live and learn!
I had two other hosts to do this right.  For virt2, and virt3, I put the 
hosts into maintenance mode first.  However, the same problem occurred 
with failed migrations.  I proceeded anyway, brought the failed VMs back 
up elsewhere, applied the updates, and rebooted the hosts.
So now, 3.6 is installed on the engine and the 3 hosts, and they are all 
rebooted.
I tried another migration, and again, there were failures, so this isn't 
specifically related to just 3.6.
By the way, I'm using ovirtmgmt for migrations.  virt1, virt2, and virt3 
have a dedicated 10G link via Intel X540 to a 10G switch. engine is on 
that network as well, but it's a 1G link.
I was able to run iperf tests between the nodes, and saw nearly 10G 
speed.  During the failed migrations, I also don't have any problem with 
ovirtmgmt, so I don't think the network is an issue...

I found this bug in bugzilla over the weekend:

https://bugzilla.redhat.com/show_bug.cgi?id=1142776

I was nearly positive that this had something to do with the failed 
migrations.  As a final test, I decided to migrate the VMs from one host 
to another, one at a time.  I was nearly done migrating all the VMs from 
virt3 to virt1.   I had migrated 5 VMs all successfully, one at a time, 
without any failures.  When I migrated the 6th, boom - it didn't 
migrate, and the VM was down.  It was a pretty basic VM as well, with 
very little traffic.

I included on the bug report above an additional link with the engine, 
virt1, virt2, and virt3 logs for Saturday where I was doing this 
experimentation because there's a couple more failures recorded.  I'll 
include that link here:

http://www.eecs.yorku.ca/~jas/ovirt-debug/11072015

The last VM that I attempted to transfer one at a time was "webapp".  It 
was transferred from virt3 to virt1.

I'm really puzzled that more people haven't experienced this issue.   
I've disabled the load balancing feature because I'm really concerned 
that if it load balances my VMs, then they might not come back up!  I 
don't *think* this was happening when I was all purely 3.5, but I can't 
remember doing big migrations.  I most certainly was able to put a host 
into maintenance mode without having VMs go down!

In another email, Dan Kenisberg says that "It seems that 3.6's 
vdsm-4.17.10.1 cannot consume a Random Number Generator device that was 
created on 3.5.".  Thanks also to Dan for looking into that as well!   
I'm still waiting for more details though before opening additional bug 
reports because this puzzles me... if this were the case, then ALL of 
the VMs were created on 3.5, and ALL with random number generator 
device, and all would fail migration, but they don't.   I have a feeling 
that there are a few issues at play here.

Jason.

On 11/09/2015 11:13 AM, Shmuel Melamud wrote:
...
Hi!
I'm trying to reproduce your issue. Can you help me with the exact 
scenario?
1. You had 3.5 running. What version of VDSM was on the hosts?
2. You replaced the engine and restarted it. Now it is 3.6, right?
3. You put a host into maintenance. Failure occured when VMs were 
migrating from it? Or you put the host into maintenance, replaced VDSM 
on it and failure occured when VMs were migrating to it from other hosts?
Shmuel
On Fri, Nov 6, 2015 at 6:21 PM, Jason Keltz <jas@cse.yorku.ca 
<mailto:jas@cse.yorku.ca>> wrote:
Hi.
Last night, I upgraded my engine from 3.5 to 3.6.  That went
    flawlessly.
    Today, I'm trying to upgrade the vdsm on the hosts from 3.5 to 3.6
    (along with applying other RHEL7.1 updates) However, when I'm
    trying to put each host into maintenance mode, and migrations
    start to occur, they all seem to FAIL now!  Even worse, when they
    fail, it leaves the hosts DOWN!  If there's a failure, I'd expect
    the host to simply abort the migration....  Any help in debugging
    this would be VERY much appreciated!
2015-11-06 10:09:16,065 ERROR
        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
        (org.ovirt.thread.pool-8-thread-4) [] Correlation ID:
        658ba478, Job ID: 524e8c44-04e0-42d3-89f9-9f4e4d397583, Call
        Stack: null, Custom Event ID: -1, Message: Migration failed 
        (VM: eportfolio, Source: virt1).
        2015-11-06 10:10:17,112 ERROR
        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
        (org.ovirt.thread.pool-8-thread-22) [2f0dee16] Correlation ID:
        7da3ac1b, Job ID: 93c0b1f2-4c8e-48cf-9e63-c1ba91be425f, Call
        Stack: null, Custom Event ID: -1, Message: Migration failed 
        (VM: ftp1, Source: virt1).
        2015-11-06 10:15:08,273 ERROR
        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
        (org.ovirt.thread.pool-8-thread-45) [] Correlation ID:
        5394ef76, Job ID: 994065fc-a142-4821-934a-c2297d86ec12, Call
        Stack: null, Custom Event ID: -1, Message: Migration failed 
        while Host is in 'preparing for maintenance' state.
        2015-11-06 10:19:13,712 ERROR
        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
        (org.ovirt.thread.pool-8-thread-36) [] Correlation ID:
        6e422728, Job ID: 994065fc-a142-4821-934a-c2297d86ec12, Call
        Stack: null, Custom Event ID: -1, Message: Migration failed 
        while Host is in 'preparing for maintenance' state.
        2015-11-06 10:42:37,852 ERROR
        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
        (org.ovirt.thread.pool-8-thread-12) [] Correlation ID:
        e7f6300, Job ID: 1ea16622-0fa0-4e92-89e5-9dc235c03ef8, Call
        Stack: null, Custom Event ID: -1, Message: Migration failed 
        (VM: ipa, Source: virt1).
        2015-11-06 10:43:59,732 ERROR
        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
        (org.ovirt.thread.pool-8-thread-40) [] Correlation ID:
        39cfdf9, Job ID: 72be29bc-a02b-4a90-b5ec-8b995c2fa692, Call
        Stack: null, Custom Event ID: -1, Message: Migration failed 
        (VM: labtesteval, Source: virt1).
        2015-11-06 10:52:11,893 ERROR
        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
        (org.ovirt.thread.pool-8-thread-23) [] Correlation ID:
        5c435149, Job ID: 1dcd1e14-baa6-44bc-a853-5d33107b759c, Call
        Stack: null, Custom Event ID: -1, Message: Migration failed 
        (VM: www-vhost, Source: virt1).
The complete engine log, virt1, virt2, and virt3 vdsm logs are here:
http://www.eecs.yorku.ca/~jas/ovirt-debug/11062015
    <http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/11062015>
Jason.
_______________________________________________
    Users mailing list
    Users@ovirt.org <mailto:Users@ovirt.org>
    http://lists.ovirt.org/mailman/listinfo/users
--------------000709050608050206080203
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Hi Shmuel,<br>
    <br>
    Thanks very much for looking into my problem!  <br>
    <br>
    I installed 3.6 on the engine.  I rebooted the engine.  <br>
    The 3 hosts were still running vdsm from 3.5.  I checked back in the
    yum log, and it was 4.16.26-0.el7.<br>
    On the first host upgrade (virt1), I made a mistake.  After bringing
    in the 3.6 repo, I upgraded the packages with just "yum update". 
    However, I know that I should have put the host into maintenance
    mode first.  After the updates installed, I put the host into
    maintenance mode, and it migrated the VMs off, during which I saw
    more than one failed VM migration.  <br>
    I'm willing to accept the failures there because I should have put
    the host into maintenance mode first.  Live and learn!<br>
    I had two other hosts to do this right.  For virt2, and virt3, I put
    the hosts into maintenance mode first.  However, the same problem
    occurred with failed migrations.  I proceeded anyway, brought the
    failed VMs back up elsewhere, applied the updates, and rebooted the
    hosts.<br>
    So now, 3.6 is installed on the engine and the 3 hosts, and they are
    all rebooted.<br>
    I tried another migration, and again, there were failures, so this
    isn't specifically related to just 3.6.<br>
    By the way, I'm using ovirtmgmt for migrations.  virt1, virt2, and
    virt3 have a dedicated 10G link via Intel X540 to a 10G switch. 
    engine is on that network as well, but it's a 1G link.<br>
    I was able to run iperf tests between the nodes, and saw nearly 10G
    speed.  During the failed migrations, I also don't have any problem
    with ovirtmgmt, so I don't think the network is an issue...<br>
    <br>
    I found this bug in bugzilla over the weekend:<br>
    <br>
    <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1142776">https://bugzilla.redhat.com/show_bug.cgi?id=1142776</a><br>
    <br>
    I was nearly positive that this had something to do with the failed
    migrations.  As a final test, I decided to migrate the VMs from one
    host to another, one at a time.  I was nearly done migrating all the
    VMs from virt3 to virt1.   I had migrated 5 VMs all successfully,
    one at a time, without any failures.  When I migrated the 6th, boom
    - it didn't migrate, and the VM was down.  It was a pretty basic VM
    as well, with very little traffic.  <br>
    <br>
    I included on the bug report above an additional link with the
    engine, virt1, virt2, and virt3 logs for Saturday where I was doing
    this experimentation because there's a couple more failures
    recorded.  I'll include that link here:<br>
    <br>
    <a class="moz-txt-link-freetext" href="http://www.eecs.yorku.ca/~jas/ovirt-debug/11072015">http://www.eecs.yorku.ca/~jas/ovirt-debug/11072015</a><br>
    <br>
    The last VM that I attempted to transfer one at a time was
    "webapp".  It was transferred from virt3 to virt1.<br>
    <br>
    I'm really puzzled that more people haven't experienced this
    issue.   I've disabled the load balancing feature because I'm really
    concerned that if it load balances my VMs, then they might not come
    back up!  I don't *think* this was happening when I was all purely
    3.5, but I can't remember doing big migrations.  I most certainly
    was able to put a host into maintenance mode without having VMs go
    down!<br>
    <br>
    In another email, Dan Kenisberg says that "It seems that 3.6's
    vdsm-4.17.10.1 cannot consume a Random Number Generator device that
    was created on 3.5.".  Thanks also to Dan for looking into that as
    well!   I'm still waiting for more details though before opening
    additional bug reports because this puzzles me... if this were the
    case, then ALL of the VMs were created on 3.5, and ALL with random
    number generator device, and all would fail migration, but they
    don't.   I have a feeling that there are a few issues at play here. 
    <br>
    <br>
    Jason.<br>
    <br>
    <div class="moz-cite-prefix">On 11/09/2015 11:13 AM, Shmuel Melamud
      wrote:<br>
    </div>
    <blockquote
cite="mid:CAMFVLy0t-Nq6EMA_vyA5EXwTpN473aALA1XOKNOV2ctcnvCqAg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_default"
          style="font-family:monospace,monospace">Hi!<br>
          <br>
        </div>
        <div class="gmail_default"
          style="font-family:monospace,monospace">I'm trying to
          reproduce your issue. Can you help me with the exact scenario?<br>
          <br>
          1. You had 3.5 running. What version of VDSM was on the hosts?<br>
          2. You replaced the engine and restarted it. Now it is 3.6,
          right?<br>
        </div>
        <div class="gmail_default"
          style="font-family:monospace,monospace">3. You put a host into
          maintenance. Failure occured when VMs were migrating from it?
          Or you put the host into maintenance, replaced VDSM on it and
          failure occured when VMs were migrating to it from other
          hosts?</div>
        <div class="gmail_default"
          style="font-family:monospace,monospace"><br>
        </div>
        <div class="gmail_default"
          style="font-family:monospace,monospace">Shmuel<br>
        </div>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">On Fri, Nov 6, 2015 at 6:21 PM, Jason
            Keltz <span dir="ltr"><<a moz-do-not-send="true"
                href="mailto:jas@cse.yorku.ca" target="_blank">jas@cse.yorku.ca</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi.<br>
              <br>
              Last night, I upgraded my engine from 3.5 to 3.6.  That
              went flawlessly.<br>
              Today, I'm trying to upgrade the vdsm on the hosts from
              3.5 to 3.6 (along with applying other RHEL7.1 updates) 
              However, when I'm trying to put each host into maintenance
              mode, and migrations start to occur, they all seem to FAIL
              now!  Even worse, when they fail, it leaves the hosts
              DOWN!  If there's a failure, I'd expect the host to simply
              abort the migration....  Any help in debugging this would
              be VERY much appreciated!<br>
              <br>
              <blockquote class="gmail_quote" style="margin:0 0 0
                .8ex;border-left:1px #ccc solid;padding-left:1ex">
                2015-11-06 10:09:16,065 ERROR
                [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
                (org.ovirt.thread.pool-8-thread-4) [] Correlation ID:
                658ba478, Job ID: 524e8c44-04e0-42d3-89f9-9f4e4d397583,
                Call Stack: null, Custom Event ID: -1, Message:
                Migration failed  (VM: eportfolio, Source: virt1).<br>
                2015-11-06 10:10:17,112 ERROR
                [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
                (org.ovirt.thread.pool-8-thread-22) [2f0dee16]
                Correlation ID: 7da3ac1b, Job ID:
                93c0b1f2-4c8e-48cf-9e63-c1ba91be425f, Call Stack: null,
                Custom Event ID: -1, Message: Migration failed  (VM:
                ftp1, Source: virt1).<br>
                2015-11-06 10:15:08,273 ERROR
                [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
                (org.ovirt.thread.pool-8-thread-45) [] Correlation ID:
                5394ef76, Job ID: 994065fc-a142-4821-934a-c2297d86ec12,
                Call Stack: null, Custom Event ID: -1, Message:
                Migration failed  while Host is in 'preparing for
                maintenance' state.<br>
                2015-11-06 10:19:13,712 ERROR
                [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
                (org.ovirt.thread.pool-8-thread-36) [] Correlation ID:
                6e422728, Job ID: 994065fc-a142-4821-934a-c2297d86ec12,
                Call Stack: null, Custom Event ID: -1, Message:
                Migration failed  while Host is in 'preparing for
                maintenance' state.<br>
                2015-11-06 10:42:37,852 ERROR
                [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
                (org.ovirt.thread.pool-8-thread-12) [] Correlation ID:
                e7f6300, Job ID: 1ea16622-0fa0-4e92-89e5-9dc235c03ef8,
                Call Stack: null, Custom Event ID: -1, Message:
                Migration failed  (VM: ipa, Source: virt1).<br>
                2015-11-06 10:43:59,732 ERROR
                [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
                (org.ovirt.thread.pool-8-thread-40) [] Correlation ID:
                39cfdf9, Job ID: 72be29bc-a02b-4a90-b5ec-8b995c2fa692,
                Call Stack: null, Custom Event ID: -1, Message:
                Migration failed  (VM: labtesteval, Source: virt1).<br>
                2015-11-06 10:52:11,893 ERROR
                [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
                (org.ovirt.thread.pool-8-thread-23) [] Correlation ID:
                5c435149, Job ID: 1dcd1e14-baa6-44bc-a853-5d33107b759c,
                Call Stack: null, Custom Event ID: -1, Message:
                Migration failed  (VM: www-vhost, Source: virt1).<br>
              </blockquote>
              <br>
              <br>
              The complete engine log, virt1, virt2, and virt3 vdsm logs
              are here:<br>
              <br>
              <a moz-do-not-send="true"
                href="http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/11062015"
                rel="noreferrer" target="_blank">http://www.eecs.yorku.ca/~jas/ovirt-debug/11062015</a><br>
              <br>
              Jason.<br>
              <br>
              _______________________________________________<br>
              Users mailing list<br>
              <a moz-do-not-send="true" href="mailto:Users@ovirt.org"
                target="_blank">Users@ovirt.org</a><br>
              <a moz-do-not-send="true"
                href="http://lists.ovirt.org/mailman/listinfo/users"
                rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br>
            </blockquote>
          </div>
          <br>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------000709050608050206080203--