Migration of Jenkins VM to new Cluster failed

Nadav Goldin ngoldin at redhat.com
Fri Jun 24 00:22:58 UTC 2016


Hi Evgheni,
Unfortunately migrating the Jenkins VM failed, to (my) luck its back
running in the old Production cluster. so we could track this I am
listing again the steps taken today:

1. around 18:00 TLV time, I triggered a snapshot of the VM. This not
only failed but caused the Jenkins VM to be none-responsive for a few
minutes. More distributing is that although in the 'event's in the
engine it announced a failure, under 'snapshots' the new snapshot was
listed under status 'ok'. this also caused few CI failures(which were
re-triggered).

2. As snapshot seems like a no-option, I created a new VM in the
production cluster jenkins-2.phx.ovirt.org, and downloaded the latest
backup from backup.phx.ovirt.org, so in case of a failure we could
change the DNS and use it(keep in mind this backup does not have any
builds, only logs/configs)

3. I shut down the VM from the engine - it was hanging for a few
minutes in 'shutting down' and then announced 'shutdown failed', which
caused it to appear again in 'up' state but it was non responsive.
virsh -r --list also stated it was up.

4. I triggered another shutdown, which succeeded. As I didn't want to
risk it any more I let it boot in the same cluster, which was also
successful.

I've attached some parts of engine.log, from a quick look on vdsm.log
I didn't see anything but could help if someone else have a look(this
is ovirt-srv02). the relevant log times for the shut down failure are
from '2016-06-23 16:15'.

Either way until we find the problem, I'm not sure we should risk it
before we have a proper recovery plan. One brute-force option is using
rsync from jenkins.phx.ovirt.org:/var/lib/data/jenkins to jenkins-2,
with jenkins daemon itself shut down on 'jenkins-2', then we could
schedule a downtime on jenkins.phx.ovirt.org, wait that everything is
synced, and stop jenkins(and puppet), then start jenkins daemon on
jenkins-2 and change the DNS cname of jenkins.ovirt.org to point to
it. if everything goes smooth it should run fine, and if not, we still
have jenkins.phx.ovirt.org running.

another option is to unmount /var/lib/data/  and mount it back to
jenkins-2, though then we might be in trouble if something goes wrong
on the way.


Nadav.
-------------- next part --------------
engine.log
snapshot event
2016-06-23 09:06:49,592 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-44) VM jenkins-phx-ovirt-org e7a7b735-0310-4f88-9ed9-4fed85835a01 moved from Up --> Paused
, Custom Event ID: -1, Message: Failed to create live snapshot 'ngoldin_before_cluster_move' for VM 'jenkins-phx-ovirt-org'. VM restart is recommended. Note that using the created snapshot might cause data inconsistency.
, Custom Event ID: -1, Message: Failed to complete snapshot 'ngoldin_before_cluster_move' creation for VM 'jenkins-phx-ovirt-org'.
2016-06-23 09:17:29,020 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-69) VM jenkins-phx-ovirt-org e7a7b735-0310-4f88-9ed9-4fed85835a01 moved from Paused --> Up

failed shutdown
2016-06-23 15:59:20,348 INFO  [org.ovirt.engine.core.bll.ShutdownVmCommand] (org.ovirt.thread.pool-8-thread-25) [52b9dd27] Entered (VM jenkins-phx-ovirt-org).
2016-06-23 15:59:20,349 INFO  [org.ovirt.engine.core.bll.ShutdownVmCommand] (org.ovirt.thread.pool-8-thread-25) [52b9dd27] Sending shutdown command for VM jenkins-phx-ovirt-org.
2016-06-23 15:59:20,446 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-8-thread-25) [52b9dd27] Correlation ID: 52b9dd27, Job ID: f1f0d78e-ae68-465e-a3c1-e46d146fc2e7, Call Stack: null, Custom Event ID: -1, Message: VM shutdown initiated by admin on VM jenkins-phx-ovirt-org (Host: ovirt-srv02) (Reason: Not Specified).
2016-06-23 16:04:20,556 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-20) [2d2d1b3a] VM jenkins-phx-ovirt-org e7a7b735-0310-4f88-9ed9-4fed85835a01 moved from PoweringDown --> Up
2016-06-23 16:04:20,628 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-20) [2d2d1b3a] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Shutdown of VM jenkins-phx-ovirt-org failed.



More information about the Infra mailing list