Download.gluster.org 27 April 2016 postmortem
by Michael Scherer
--=-dJs8Emid8NImVphJKIWy
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Hi,
as promised, here is the post-mortem of the incident, if you would like
to see more information, or any remarks, please do not hesitate, since
that's the first attempt at it we do.
I modelled it based on the example of
http://shop.oreilly.com/product/0636920041528.do, as that the book I am
reading at the moment (Appendix D). We will formalize that later.
Download.gluster.org was not serving file
Date: 2016-04-27
Participating people:
- misc
Summary:
Download.gluster.org http server was showing error 403 for all url,
which did impact ovirt jenkins jobs, and users using the repository,
among others. The server is used to distribute gluster rpms.
Impact:
- ovirt CI jobs got blocked
- user couldn't install gluster
Root cause:
the underlying block device on rackspace was down for a undiagnosed
reason, triggering xfs error on the server and thus 403 on the http
level.
the root cause of the block device error is for still unknown, no error
have been seen on the rackspace status page for this DC. A ticket was
opened with rackspace to see what was going on (160427-iad-0000814), a
follow up of this post-mortem will be done if the ticket say something
more than "shit happens".
Resolution:
The whole server was rebooted, and upon reboot, the block device came
back.
Lessons learned:
- what went well:
- people notified the admin quickly on irc and on gluster-infra
- when we were lucky
- the server and block device came back immediately
- it failed during business hours of EMEA with misc being on irc (just
arrived at the office)
- what went bad
- we do not have proper HA for the service
- we do not have automated monitoring for it
- the setup is using 2 blocks device of 120G in lvm, thus making it
twice as risky to fail
Timeline (in UTC)
- 05:39 first error message in the log about XFS error
- 08:41 misc is pinged on irc
- 08:56 misc ack and diagnose the issue
- 09:00 the server and service is back to normal
- 09:00 first mail about the problem hit gluster-infra
=20
Potential improvement to make:
- add monitoring on gluster side
- use the centos sig repo on ovirt side
- add more sysadmin for gluster
- add a redundant service for that
- a 2nd download server with a shared gluster backend
- migrate the storage to a proper setup with 1 single block device,
rather than 2.
--=20
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS
--=-dJs8Emid8NImVphJKIWy
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJXIJr5AAoJEE89Wa+PrSK9NhsP+gJVCNMzrKanfmQ7CcB2Jkj3
oG+xU83jZLoeynxGX47Euk5SP4ay+/iB4BWX7px12tzy9InTM9SLciU76aBglI5m
ZURdHEQkmWKjFTf2h6ProWH263YyYV+AYVHr4X2uktfTXJCXYTwp+7vEXjb19VHt
ATAjqBUISpp43/PcSgFcOoY8xiVAgl9bhtFKIaJEpeD42kmt9/JlK71+PN28h+9S
M/4hugCIJl2uycCrouTx6rfDRPsXgAdJOn/i2n3JbFprIYLh8JqLHXmn0Vx8DrFg
CnTuy71S0m+JGaqN3A9gbg+K7RUtROmbvuGc/+1VU1SKpQ8WlQTzFE5vDTv/+pdX
zaAvlMehyxkjBAZRhrb8gdrUbi1fAYV8BXgqp3JVGPoJ3kUjm3H1XzIlQO4byP/A
DeigCUYXZUP9sFuHQbfYfeHAtHJN+TxqbdKtcWBpUlCmjpjQppEOSlQIh4ON7KvK
MyvMsRZCOGEjJtT7tRU/b3e+qq/xPunvhzRrphCor5IpbmPYHzpP8433S8ywAtxn
gKc/H7fZdine1V8ytk4YRCSMegmtNXAg/wpjZMN5b8Pu2hDyBWvctBoq+nZMlmm1
IMzRaaY8tCWheTEG1GNEzG63C3ON0Ms+2ZzOuQ6WE1CZEEdaA25JSPbGIP4NnDvz
/hF9VoG9OpKrR9ZGSLWr
=dAjB
-----END PGP SIGNATURE-----
--=-dJs8Emid8NImVphJKIWy--
8 years, 8 months
[JIRA] (OVIRT-451) jenkins slaves ssh connection drops too often
by eyal edri [Administrator] (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-451?page=com.atlassian.jira... ]
eyal edri [Administrator] reassigned OVIRT-451:
-----------------------------------------------
Assignee: Nadav Goldin (was: infra)
Can you look into it today if you have time?
> jenkins slaves ssh connection drops too often
> ---------------------------------------------
>
> Key: OVIRT-451
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-451
> Project: oVirt - virtualization made easy
> Issue Type: Task
> Reporter: sbonazzo
> Assignee: Nadav Goldin
>
> I'm connecting to ovirt servers (jenkins slaves, resources,...) using foreman.ovirt.org as bastion.
> if I don't actively use the keyboard on the ssh connection, the connection is closed in a few seconds with:
> packet_write_wait: Connection to UNKNOWN: Broken pipe
> please configure the servers for not closing the ssh connection automatically or for doing that only after 1 hour.
--
This message was sent by Atlassian JIRA
(v1000.5.0#72002)
8 years, 8 months
Fwd: [Gluster-infra] [ovirt-users] [Attention needed] GlusterFS repository down - affects CI / Installations
by Nadav Goldin
adding infra
---------- Forwarded message ----------
From: Niels de Vos <ndevos(a)redhat.com>
Date: Wed, Apr 27, 2016 at 12:09 PM
Subject: Re: [Gluster-infra] [ovirt-users] [Attention needed] GlusterFS
repository down - affects CI / Installations
To: Ravishankar N <ravishankar(a)redhat.com>
Cc: devel <devel(a)ovirt.org>, gluster-infra <gluster-infra(a)gluster.org>,
Nadav Goldin <ngoldin(a)redhat.com>, "gluster-users(a)gluster.org List" <
Gluster-users(a)gluster.org>, users(a)ovirt.org
On Wed, Apr 27, 2016 at 02:30:57PM +0530, Ravishankar N wrote:
> @gluster infra - FYI.
>
> On 04/27/2016 02:20 PM, Nadav Goldin wrote:
> >Hi,
> >The GlusterFS repository became unavailable this morning, as a result all
> >Jenkins jobs that use the repository will fail, the common error would
be:
> >
> >
http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/epel-7...
:
> > [Errno 14] HTTP Error 403 - Forbidden
> >
> >
> >Also, installations of oVirt will fail.
I thought oVirt moved to using the packages from the CentOS Storage SIG?
In any case, automated tests should probably use those instead of the
packages on download.gluster.org. We're trying to minimize the work
packagers need to do, and get the glusterfs and other components in the
repositories that are provided by different distributions.
For more details, see the quickstart for the Storage SIG here:
https://wiki.centos.org/SpecialInterestGroup/Storage/gluster-Quickstart
HTH,
Niels
8 years, 8 months