[JIRA] (OVIRT-609) Jenkins snapshot creation failed
by Evgheni Dereveanchin (oVirt JIRA)
Evgheni Dereveanchin created OVIRT-609:
------------------------------------------
Summary: Jenkins snapshot creation failed
Key: OVIRT-609
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-609
Project: oVirt - virtualization made easy
Issue Type: Bug
Reporter: Evgheni Dereveanchin
Assignee: infra
[~ngoldin(a)redhat.com] issued a live snapshot creation on the Jenkins VM to prepare it for cluster move. This failed and it's not really clear why. Relevant event logs below, suggesting that the hypervisor started dumping VM memory to the snapshot which caused a storage slowdown.
2016-Jun-23, 18:06 Snapshot 'ngoldin_before_cluster_move' creation for VM 'jenkins-phx-ovirt-org' was initiated by admin.
2016-Jun-23, 18:09 Failed to create live snapshot 'ngoldin_before_cluster_move' for VM 'jenkins-phx-ovirt-org'. VM restart is recommended. Note that using the created snapshot might cause data inconsistency.
2016-Jun-23, 18:13 Host ovirt-srv02 has network interface which exceeded the defined threshold [95%] (em1: transmit rate[100%], receive rate [0%])
2016-Jun-23, 18:13 Storage domain Production experienced a high latency of 18.7802 seconds from host ovirt-srv11. This may cause performance and functional issues. Please consult your Storage Administrator.
--
This message was sent by Atlassian JIRA
(v1000.98.4#100004)
8 years, 5 months
Migration of Jenkins VM to new Cluster failed
by Nadav Goldin
Hi Evgheni,
Unfortunately migrating the Jenkins VM failed, to (my) luck its back
running in the old Production cluster. so we could track this I am
listing again the steps taken today:
1. around 18:00 TLV time, I triggered a snapshot of the VM. This not
only failed but caused the Jenkins VM to be none-responsive for a few
minutes. More distributing is that although in the 'event's in the
engine it announced a failure, under 'snapshots' the new snapshot was
listed under status 'ok'. this also caused few CI failures(which were
re-triggered).
2. As snapshot seems like a no-option, I created a new VM in the
production cluster jenkins-2.phx.ovirt.org, and downloaded the latest
backup from backup.phx.ovirt.org, so in case of a failure we could
change the DNS and use it(keep in mind this backup does not have any
builds, only logs/configs)
3. I shut down the VM from the engine - it was hanging for a few
minutes in 'shutting down' and then announced 'shutdown failed', which
caused it to appear again in 'up' state but it was non responsive.
virsh -r --list also stated it was up.
4. I triggered another shutdown, which succeeded. As I didn't want to
risk it any more I let it boot in the same cluster, which was also
successful.
I've attached some parts of engine.log, from a quick look on vdsm.log
I didn't see anything but could help if someone else have a look(this
is ovirt-srv02). the relevant log times for the shut down failure are
from '2016-06-23 16:15'.
Either way until we find the problem, I'm not sure we should risk it
before we have a proper recovery plan. One brute-force option is using
rsync from jenkins.phx.ovirt.org:/var/lib/data/jenkins to jenkins-2,
with jenkins daemon itself shut down on 'jenkins-2', then we could
schedule a downtime on jenkins.phx.ovirt.org, wait that everything is
synced, and stop jenkins(and puppet), then start jenkins daemon on
jenkins-2 and change the DNS cname of jenkins.ovirt.org to point to
it. if everything goes smooth it should run fine, and if not, we still
have jenkins.phx.ovirt.org running.
another option is to unmount /var/lib/data/ and mount it back to
jenkins-2, though then we might be in trouble if something goes wrong
on the way.
Nadav.
8 years, 5 months
[Attention] Jenkins maintenance today(24/06/2016 01:00 AM TLV)
by Nadav Goldin
Hi,
As part of an infrastructure upgrade, in approximately one hour at
01:00 AM TLV, http://jenkins.ovirt.org will be shut down for
maintenance, expected downtime is 15 minutes.
Patches sent during the downtime will be checked afterwards, patches
sent around 40 minutes prior to the downtime might not get checked.
If patches you sent did not trigger CI, you can login after the
downtime and re-trigger them manually.
Thanks,
Nadav.
8 years, 5 months
[JIRA] (OVIRT-608) [URGENT] Half of the Jenkins slaves are offline
by Nadav Goldin (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-608?page=com.atlassian.jira... ]
Nadav Goldin commented on OVIRT-608:
------------------------------------
quick check - all disconnected VMs were disconnected on purpose to reduce load.
22 VMs were disconnected in order to reduce load, most of them by [~dcaroest] last week, not sure how it was calculated.
2 BM slaves are offeline, most likely they lost their IP because of DHCP problem.
> [URGENT] Half of the Jenkins slaves are offline
> -----------------------------------------------
>
> Key: OVIRT-608
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-608
> Project: oVirt - virtualization made easy
> Issue Type: By-EMAIL
> Reporter: sbonazzo
> Assignee: infra
>
> Please check what happened, 24 jenkins slaves are down right now.
> --
> Sandro Bonazzola
> Better technology. Faster innovation. Powered by community collaboration.
> See how it works at redhat.com
--
This message was sent by Atlassian JIRA
(v1000.98.4#100004)
8 years, 5 months
[JIRA] (OVIRT-608) [URGENT] Half of the Jenkins slaves are offline
by Nadav Goldin (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-608?page=com.atlassian.jira... ]
Nadav Goldin commented on OVIRT-608:
------------------------------------
During the evening I tried creating a snapshot of the Jenkins VM, which surprisingly caused the entire storage domain to slow down, the snapshot failed, and halted the Jenkins VM for a few minutes, I'll check if this might have disconnected more slaves than we intended.
.
> [URGENT] Half of the Jenkins slaves are offline
> -----------------------------------------------
>
> Key: OVIRT-608
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-608
> Project: oVirt - virtualization made easy
> Issue Type: By-EMAIL
> Reporter: sbonazzo
> Assignee: infra
>
> Please check what happened, 24 jenkins slaves are down right now.
> --
> Sandro Bonazzola
> Better technology. Faster innovation. Powered by community collaboration.
> See how it works at redhat.com
--
This message was sent by Atlassian JIRA
(v1000.98.4#100004)
8 years, 5 months
[JIRA] (OVIRT-608) [URGENT] Half of the Jenkins slaves are offline
by eyal edri [Administrator] (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-608?page=com.atlassian.jira... ]
eyal edri [Administrator] commented on OVIRT-608:
-------------------------------------------------
Some of the slaves are offline for a reason. We have storage overload that
can risk the stability of the entire DC, and until we won't move to use
local disk storage we can't keep all the slaves running all the time.
Having said that, if there something critical, we can start a few more
vms to unlock a critical fix.
On Jun 23, 2016 9:49 PM, "sbonazzo (oVirt JIRA)" <
jira(a)ovirt-jira.atlassian.net> wrote:
sbonazzo created OVIRT-608:
------------------------------
Summary: [URGENT] Half of the Jenkins slaves are offline
Key: OVIRT-608
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-608
Project: oVirt - virtualization made easy
Issue Type: By-EMAIL
Reporter: sbonazzo
Assignee: infra
Please check what happened, 24 jenkins slaves are down right now.
--
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com
--
This message was sent by Atlassian JIRA
(v1000.98.4#100004)
> [URGENT] Half of the Jenkins slaves are offline
> -----------------------------------------------
>
> Key: OVIRT-608
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-608
> Project: oVirt - virtualization made easy
> Issue Type: By-EMAIL
> Reporter: sbonazzo
> Assignee: infra
>
> Please check what happened, 24 jenkins slaves are down right now.
> --
> Sandro Bonazzola
> Better technology. Faster innovation. Powered by community collaboration.
> See how it works at redhat.com
--
This message was sent by Atlassian JIRA
(v1000.98.4#100004)
8 years, 5 months
Re: [JIRA] (OVIRT-608) [URGENT] Half of the Jenkins slaves are offline
by Eyal Edri
Some of the slaves are offline for a reason. We have storage overload that
can risk the stability of the entire DC, and until we won't move to use
local disk storage we can't keep all the slaves running all the time.
Having said that, if there something critical, we can start a few more
vms to unlock a critical fix.
On Jun 23, 2016 9:49 PM, "sbonazzo (oVirt JIRA)" <
jira(a)ovirt-jira.atlassian.net> wrote:
sbonazzo created OVIRT-608:
------------------------------
Summary: [URGENT] Half of the Jenkins slaves are offline
Key: OVIRT-608
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-608
Project: oVirt - virtualization made easy
Issue Type: By-EMAIL
Reporter: sbonazzo
Assignee: infra
Please check what happened, 24 jenkins slaves are down right now.
--
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com
--
This message was sent by Atlassian JIRA
(v1000.98.4#100004)
_______________________________________________
Infra mailing list
Infra(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra
8 years, 5 months
[JIRA] (OVIRT-608) [URGENT] Half of the Jenkins slaves are offline
by sbonazzo (oVirt JIRA)
sbonazzo created OVIRT-608:
------------------------------
Summary: [URGENT] Half of the Jenkins slaves are offline
Key: OVIRT-608
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-608
Project: oVirt - virtualization made easy
Issue Type: By-EMAIL
Reporter: sbonazzo
Assignee: infra
Please check what happened, 24 jenkins slaves are down right now.
--
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com
--
This message was sent by Atlassian JIRA
(v1000.98.4#100004)
8 years, 5 months