Hi Evgheni,
Unfortunately migrating the Jenkins VM failed, to (my) luck its back
running in the old Production cluster. so we could track this I am
listing again the steps taken today:
1. around 18:00 TLV time, I triggered a snapshot of the VM. This not
only failed but caused the Jenkins VM to be none-responsive for a few
minutes. More distributing is that although in the 'event's in the
engine it announced a failure, under 'snapshots' the new snapshot was
listed under status 'ok'. this also caused few CI failures(which were
re-triggered).
2. As snapshot seems like a no-option, I created a new VM in the
production cluster
jenkins-2.phx.ovirt.org, and downloaded the latest
backup from
backup.phx.ovirt.org, so in case of a failure we could
change the DNS and use it(keep in mind this backup does not have any
builds, only logs/configs)
3. I shut down the VM from the engine - it was hanging for a few
minutes in 'shutting down' and then announced 'shutdown failed', which
caused it to appear again in 'up' state but it was non responsive.
virsh -r --list also stated it was up.
4. I triggered another shutdown, which succeeded. As I didn't want to
risk it any more I let it boot in the same cluster, which was also
successful.
I've attached some parts of engine.log, from a quick look on vdsm.log
I didn't see anything but could help if someone else have a look(this
is ovirt-srv02). the relevant log times for the shut down failure are
from '2016-06-23 16:15'.
Either way until we find the problem, I'm not sure we should risk it
before we have a proper recovery plan. One brute-force option is using
rsync from jenkins.phx.ovirt.org:/var/lib/data/jenkins to jenkins-2,
with jenkins daemon itself shut down on 'jenkins-2', then we could
schedule a downtime on
jenkins.phx.ovirt.org, wait that everything is
synced, and stop jenkins(and puppet), then start jenkins daemon on
jenkins-2 and change the DNS cname of
jenkins.ovirt.org to point to
it. if everything goes smooth it should run fine, and if not, we still
have
jenkins.phx.ovirt.org running.
another option is to unmount /var/lib/data/ and mount it back to
jenkins-2, though then we might be in trouble if something goes wrong
on the way.
Nadav.