Migration of Jenkins VM to new Cluster failed

Hi Evgheni, Unfortunately migrating the Jenkins VM failed, to (my) luck its back running in the old Production cluster. so we could track this I am listing again the steps taken today: 1. around 18:00 TLV time, I triggered a snapshot of the VM. This not only failed but caused the Jenkins VM to be none-responsive for a few minutes. More distributing is that although in the 'event's in the engine it announced a failure, under 'snapshots' the new snapshot was listed under status 'ok'. this also caused few CI failures(which were re-triggered). 2. As snapshot seems like a no-option, I created a new VM in the production cluster jenkins-2.phx.ovirt.org, and downloaded the latest backup from backup.phx.ovirt.org, so in case of a failure we could change the DNS and use it(keep in mind this backup does not have any builds, only logs/configs) 3. I shut down the VM from the engine - it was hanging for a few minutes in 'shutting down' and then announced 'shutdown failed', which caused it to appear again in 'up' state but it was non responsive. virsh -r --list also stated it was up. 4. I triggered another shutdown, which succeeded. As I didn't want to risk it any more I let it boot in the same cluster, which was also successful. I've attached some parts of engine.log, from a quick look on vdsm.log I didn't see anything but could help if someone else have a look(this is ovirt-srv02). the relevant log times for the shut down failure are from '2016-06-23 16:15'. Either way until we find the problem, I'm not sure we should risk it before we have a proper recovery plan. One brute-force option is using rsync from jenkins.phx.ovirt.org:/var/lib/data/jenkins to jenkins-2, with jenkins daemon itself shut down on 'jenkins-2', then we could schedule a downtime on jenkins.phx.ovirt.org, wait that everything is synced, and stop jenkins(and puppet), then start jenkins daemon on jenkins-2 and change the DNS cname of jenkins.ovirt.org to point to it. if everything goes smooth it should run fine, and if not, we still have jenkins.phx.ovirt.org running. another option is to unmount /var/lib/data/ and mount it back to jenkins-2, though then we might be in trouble if something goes wrong on the way. Nadav.
participants (1)
-
Nadav Goldin