Content-Type: text/plain; charset="UTF-8"
as promised, here is the post-mortem of the incident, if you would like
to see more information, or any remarks, please do not hesitate, since
that's the first attempt at it we do.
I modelled it based on the example of
, as that the book I am
reading at the moment (Appendix D). We will formalize that later.
was not serving file
http server was showing error 403 for all url,
which did impact ovirt jenkins jobs, and users using the repository,
among others. The server is used to distribute gluster rpms.
- ovirt CI jobs got blocked
- user couldn't install gluster
the underlying block device on rackspace was down for a undiagnosed
reason, triggering xfs error on the server and thus 403 on the http
the root cause of the block device error is for still unknown, no error
have been seen on the rackspace status page for this DC. A ticket was
opened with rackspace to see what was going on (160427-iad-0000814), a
follow up of this post-mortem will be done if the ticket say something
more than "shit happens".
The whole server was rebooted, and upon reboot, the block device came
- what went well:
- people notified the admin quickly on irc and on gluster-infra
- when we were lucky
- the server and block device came back immediately
- it failed during business hours of EMEA with misc being on irc (just
arrived at the office)
- what went bad
- we do not have proper HA for the service
- we do not have automated monitoring for it
- the setup is using 2 blocks device of 120G in lvm, thus making it
twice as risky to fail
Timeline (in UTC)
- 05:39 first error message in the log about XFS error
- 08:41 misc is pinged on irc
- 08:56 misc ack and diagnose the issue
- 09:00 the server and service is back to normal
- 09:00 first mail about the problem hit gluster-infra
Potential improvement to make:
- add monitoring on gluster side
- use the centos sig repo on ovirt side
- add more sysadmin for gluster
- add a redundant service for that
- a 2nd download server with a shared gluster backend
- migrate the storage to a proper setup with 1 single block device,
rather than 2.
Sysadmin, Community Infrastructure and Platform, OSAS
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
-----END PGP SIGNATURE-----