[ovirt-users] upgrade from 3.5 to 3.6 causing problems with migration

Mon Nov 9 23:20:10 UTC 2015

On 09/11/15 14:00 -0500, Jason Keltz wrote:
>Hi Shmuel,
>
>Thanks very much for looking into my problem!
>
>I installed 3.6 on the engine.  I rebooted the engine.
>The 3 hosts were still running vdsm from 3.5.  I checked back in the 
>yum log, and it was 4.16.26-0.el7.
>On the first host upgrade (virt1), I made a mistake.  After bringing 
>in the 3.6 repo, I upgraded the packages with just "yum update". 
>However, I know that I should have put the host into maintenance mode 
>first.  After the updates installed, I put the host into maintenance 
>mode, and it migrated the VMs off, during which I saw more than one 
>failed VM migration.
>I'm willing to accept the failures there because I should have put the 
>host into maintenance mode first.  Live and learn!
>I had two other hosts to do this right.  For virt2, and virt3, I put 
>the hosts into maintenance mode first.  However, the same problem 
>occurred with failed migrations.  I proceeded anyway, brought the 
>failed VMs back up elsewhere, applied the updates, and rebooted the 
>hosts.
>So now, 3.6 is installed on the engine and the 3 hosts, and they are 
>all rebooted.
>I tried another migration, and again, there were failures, so this 
>isn't specifically related to just 3.6.
>By the way, I'm using ovirtmgmt for migrations.  virt1, virt2, and 
>virt3 have a dedicated 10G link via Intel X540 to a 10G switch. engine 
>is on that network as well, but it's a 1G link.
>I was able to run iperf tests between the nodes, and saw nearly 10G 
>speed.  During the failed migrations, I also don't have any problem 
>with ovirtmgmt, so I don't think the network is an issue...
>
>I found this bug in bugzilla over the weekend:
>
>https://bugzilla.redhat.com/show_bug.cgi?id=1142776
>
>I was nearly positive that this had something to do with the failed 
>migrations.  As a final test, I decided to migrate the VMs from one 
>host to another, one at a time.  I was nearly done migrating all the 
>VMs from virt3 to virt1.   I had migrated 5 VMs all successfully, one 
>at a time, without any failures.  When I migrated the 6th, boom - it 
>didn't migrate, and the VM was down.  It was a pretty basic VM as 
>well, with very little traffic.
>
>I included on the bug report above an additional link with the engine, 
>virt1, virt2, and virt3 logs for Saturday where I was doing this 
>experimentation because there's a couple more failures recorded.  I'll 
>include that link here:
>
>http://www.eecs.yorku.ca/~jas/ovirt-debug/11072015
>
>The last VM that I attempted to transfer one at a time was "webapp".  
>It was transferred from virt3 to virt1.
>
>I'm really puzzled that more people haven't experienced this issue.   
>I've disabled the load balancing feature because I'm really concerned 
>that if it load balances my VMs, then they might not come back up!  I 
>don't *think* this was happening when I was all purely 3.5, but I 
>can't remember doing big migrations.  I most certainly was able to put 
>a host into maintenance mode without having VMs go down!
>
>In another email, Dan Kenisberg says that "It seems that 3.6's 
>vdsm-4.17.10.1 cannot consume a Random Number Generator device that 
>was created on 3.5.".  Thanks also to Dan for looking into that as 
>well!   I'm still waiting for more details though before opening 
>additional bug reports because this puzzles me... if this were the 
>case, then ALL of the VMs were created on 3.5, and ALL with random 
>number generator device, and all would fail migration, but they don't.   
>I have a feeling that there are a few issues at play here.

Hello and sorry for dropping in so late.

The issue is that 3.5 engine created RNG device without sending the
device key (which should've been 'rng' but it wasn't properly
documented in the API as fixed in [1]). This caused the
getUnderlyingRngDevice method to fail matching the device (fixed in
[2]) and it would therefore be treated as unknown device (where the
notion of 'source' isn't known). 3.6 engine should handle it correctly
[3].

The implication is that when VM is created in 3.5 environment and
moved to 3.6 environment, the matching will work but there will be 2
RNG devices for the single one. Same goes for migration.

I'm not sure about the fix yet, to rescue the 3.6 VM we would have to
remove the duplicate device without specParams (meaning that address
would be lost) or remove the original device but adding it's
specParams to the new device. A temporary fix would be creating a hook
that does this.

[1] https://gerrit.ovirt.org/#/c/43166/
[2] https://gerrit.ovirt.org/#/c/40095/
[3] https://gerrit.ovirt.org/#/c/43165/

Regards,
mpolednik

>Jason.
>
>On 11/09/2015 11:13 AM, Shmuel Melamud wrote:
>>Hi!
>>
>>I'm trying to reproduce your issue. Can you help me with the exact 
>>scenario?
>>
>>1. You had 3.5 running. What version of VDSM was on the hosts?
>>2. You replaced the engine and restarted it. Now it is 3.6, right?
>>3. You put a host into maintenance. Failure occured when VMs were 
>>migrating from it? Or you put the host into maintenance, replaced 
>>VDSM on it and failure occured when VMs were migrating to it from 
>>other hosts?
>>
>>Shmuel
>>
>>On Fri, Nov 6, 2015 at 6:21 PM, Jason Keltz <jas at cse.yorku.ca 
>><mailto:jas at cse.yorku.ca>> wrote:
>>
>>    Hi.
>>
>>    Last night, I upgraded my engine from 3.5 to 3.6.  That went
>>    flawlessly.
>>    Today, I'm trying to upgrade the vdsm on the hosts from 3.5 to 3.6
>>    (along with applying other RHEL7.1 updates) However, when I'm
>>    trying to put each host into maintenance mode, and migrations
>>    start to occur, they all seem to FAIL now!  Even worse, when they
>>    fail, it leaves the hosts DOWN!  If there's a failure, I'd expect
>>    the host to simply abort the migration....  Any help in debugging
>>    this would be VERY much appreciated!
>>
>>        2015-11-06 10:09:16,065 ERROR
>>        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>        (org.ovirt.thread.pool-8-thread-4) [] Correlation ID:
>>        658ba478, Job ID: 524e8c44-04e0-42d3-89f9-9f4e4d397583, Call
>>        Stack: null, Custom Event ID: -1, Message: Migration failed         
>>(VM: eportfolio, Source: virt1).
>>        2015-11-06 10:10:17,112 ERROR
>>        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>        (org.ovirt.thread.pool-8-thread-22) [2f0dee16] Correlation ID:
>>        7da3ac1b, Job ID: 93c0b1f2-4c8e-48cf-9e63-c1ba91be425f, Call
>>        Stack: null, Custom Event ID: -1, Message: Migration failed         
>>(VM: ftp1, Source: virt1).
>>        2015-11-06 10:15:08,273 ERROR
>>        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>        (org.ovirt.thread.pool-8-thread-45) [] Correlation ID:
>>        5394ef76, Job ID: 994065fc-a142-4821-934a-c2297d86ec12, Call
>>        Stack: null, Custom Event ID: -1, Message: Migration failed         
>>while Host is in 'preparing for maintenance' state.
>>        2015-11-06 10:19:13,712 ERROR
>>        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>        (org.ovirt.thread.pool-8-thread-36) [] Correlation ID:
>>        6e422728, Job ID: 994065fc-a142-4821-934a-c2297d86ec12, Call
>>        Stack: null, Custom Event ID: -1, Message: Migration failed         
>>while Host is in 'preparing for maintenance' state.
>>        2015-11-06 10:42:37,852 ERROR
>>        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>        (org.ovirt.thread.pool-8-thread-12) [] Correlation ID:
>>        e7f6300, Job ID: 1ea16622-0fa0-4e92-89e5-9dc235c03ef8, Call
>>        Stack: null, Custom Event ID: -1, Message: Migration failed         
>>(VM: ipa, Source: virt1).
>>        2015-11-06 10:43:59,732 ERROR
>>        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>        (org.ovirt.thread.pool-8-thread-40) [] Correlation ID:
>>        39cfdf9, Job ID: 72be29bc-a02b-4a90-b5ec-8b995c2fa692, Call
>>        Stack: null, Custom Event ID: -1, Message: Migration failed         
>>(VM: labtesteval, Source: virt1).
>>        2015-11-06 10:52:11,893 ERROR
>>        [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>        (org.ovirt.thread.pool-8-thread-23) [] Correlation ID:
>>        5c435149, Job ID: 1dcd1e14-baa6-44bc-a853-5d33107b759c, Call
>>        Stack: null, Custom Event ID: -1, Message: Migration failed         
>>(VM: www-vhost, Source: virt1).
>>
>>
>>
>>    The complete engine log, virt1, virt2, and virt3 vdsm logs are here:
>>
>>    http://www.eecs.yorku.ca/~jas/ovirt-debug/11062015
>>    <http://www.eecs.yorku.ca/%7Ejas/ovirt-debug/11062015>
>>
>>    Jason.
>>
>>    _______________________________________________
>>    Users mailing list
>>    Users at ovirt.org <mailto:Users at ovirt.org>
>>    http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>