Download.gluster.org 27 April 2016 postmortem

--=-dJs8Emid8NImVphJKIWy Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, as promised, here is the post-mortem of the incident, if you would like to see more information, or any remarks, please do not hesitate, since that's the first attempt at it we do. I modelled it based on the example of http://shop.oreilly.com/product/0636920041528.do, as that the book I am reading at the moment (Appendix D). We will formalize that later. Download.gluster.org was not serving file Date: 2016-04-27 Participating people: - misc Summary: Download.gluster.org http server was showing error 403 for all url, which did impact ovirt jenkins jobs, and users using the repository, among others. The server is used to distribute gluster rpms. Impact: - ovirt CI jobs got blocked - user couldn't install gluster Root cause: the underlying block device on rackspace was down for a undiagnosed reason, triggering xfs error on the server and thus 403 on the http level. the root cause of the block device error is for still unknown, no error have been seen on the rackspace status page for this DC. A ticket was opened with rackspace to see what was going on (160427-iad-0000814), a follow up of this post-mortem will be done if the ticket say something more than "shit happens". Resolution: The whole server was rebooted, and upon reboot, the block device came back. Lessons learned: - what went well: - people notified the admin quickly on irc and on gluster-infra - when we were lucky - the server and block device came back immediately - it failed during business hours of EMEA with misc being on irc (just arrived at the office) - what went bad - we do not have proper HA for the service - we do not have automated monitoring for it - the setup is using 2 blocks device of 120G in lvm, thus making it twice as risky to fail Timeline (in UTC) - 05:39 first error message in the log about XFS error - 08:41 misc is pinged on irc - 08:56 misc ack and diagnose the issue - 09:00 the server and service is back to normal - 09:00 first mail about the problem hit gluster-infra =20 Potential improvement to make: - add monitoring on gluster side - use the centos sig repo on ovirt side - add more sysadmin for gluster - add a redundant service for that - a 2nd download server with a shared gluster backend - migrate the storage to a proper setup with 1 single block device, rather than 2. --=20 Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS --=-dJs8Emid8NImVphJKIWy Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABAgAGBQJXIJr5AAoJEE89Wa+PrSK9NhsP+gJVCNMzrKanfmQ7CcB2Jkj3 oG+xU83jZLoeynxGX47Euk5SP4ay+/iB4BWX7px12tzy9InTM9SLciU76aBglI5m ZURdHEQkmWKjFTf2h6ProWH263YyYV+AYVHr4X2uktfTXJCXYTwp+7vEXjb19VHt ATAjqBUISpp43/PcSgFcOoY8xiVAgl9bhtFKIaJEpeD42kmt9/JlK71+PN28h+9S M/4hugCIJl2uycCrouTx6rfDRPsXgAdJOn/i2n3JbFprIYLh8JqLHXmn0Vx8DrFg CnTuy71S0m+JGaqN3A9gbg+K7RUtROmbvuGc/+1VU1SKpQ8WlQTzFE5vDTv/+pdX zaAvlMehyxkjBAZRhrb8gdrUbi1fAYV8BXgqp3JVGPoJ3kUjm3H1XzIlQO4byP/A DeigCUYXZUP9sFuHQbfYfeHAtHJN+TxqbdKtcWBpUlCmjpjQppEOSlQIh4ON7KvK MyvMsRZCOGEjJtT7tRU/b3e+qq/xPunvhzRrphCor5IpbmPYHzpP8433S8ywAtxn gKc/H7fZdine1V8ytk4YRCSMegmtNXAg/wpjZMN5b8Pu2hDyBWvctBoq+nZMlmm1 IMzRaaY8tCWheTEG1GNEzG63C3ON0Ms+2ZzOuQ6WE1CZEEdaA25JSPbGIP4NnDvz /hF9VoG9OpKrR9ZGSLWr =dAjB -----END PGP SIGNATURE----- --=-dJs8Emid8NImVphJKIWy--

Excellent post-mortem! Do you think its worth adding mirrors to gluster repos like oVirt is doing? [1] [1] http://ovirt-infra-docs.readthedocs.org/en/latest/General/Mirror.html On Wed, Apr 27, 2016 at 1:56 PM, Michael Scherer <mscherer@redhat.com> wrote:
Hi,
as promised, here is the post-mortem of the incident, if you would like to see more information, or any remarks, please do not hesitate, since that's the first attempt at it we do.
I modelled it based on the example of http://shop.oreilly.com/product/0636920041528.do, as that the book I am reading at the moment (Appendix D). We will formalize that later.
Download.gluster.org was not serving file Date: 2016-04-27 Participating people: - misc
Summary:
Download.gluster.org http server was showing error 403 for all url, which did impact ovirt jenkins jobs, and users using the repository, among others. The server is used to distribute gluster rpms.
Impact: - ovirt CI jobs got blocked - user couldn't install gluster
Root cause: the underlying block device on rackspace was down for a undiagnosed reason, triggering xfs error on the server and thus 403 on the http level.
the root cause of the block device error is for still unknown, no error have been seen on the rackspace status page for this DC. A ticket was opened with rackspace to see what was going on (160427-iad-0000814), a follow up of this post-mortem will be done if the ticket say something more than "shit happens".
Resolution:
The whole server was rebooted, and upon reboot, the block device came back.
Lessons learned: - what went well: - people notified the admin quickly on irc and on gluster-infra
- when we were lucky - the server and block device came back immediately - it failed during business hours of EMEA with misc being on irc (just arrived at the office)
- what went bad - we do not have proper HA for the service - we do not have automated monitoring for it - the setup is using 2 blocks device of 120G in lvm, thus making it twice as risky to fail
Timeline (in UTC) - 05:39 first error message in the log about XFS error - 08:41 misc is pinged on irc - 08:56 misc ack and diagnose the issue - 09:00 the server and service is back to normal - 09:00 first mail about the problem hit gluster-infra
Potential improvement to make: - add monitoring on gluster side - use the centos sig repo on ovirt side - add more sysadmin for gluster - add a redundant service for that - a 2nd download server with a shared gluster backend - migrate the storage to a proper setup with 1 single block device, rather than 2.
-- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
-- Eyal Edri Associate Manager RHEV DevOps EMEA ENG Virtualization R&D Red Hat Israel phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)

--=-AwrVs6Qx264n4bke8U9d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Le mercredi 27 avril 2016 =C3=A0 14:39 +0300, Eyal Edri a =C3=A9crit :
Excellent post-mortem! =20 Do you think its worth adding mirrors to gluster repos like oVirt is doin= g? [1] =20 [1] http://ovirt-infra-docs.readthedocs.org/en/latest/General/Mirror.html
That could be a solution.=20 But we have the ressources to host a mirror ourself in the DC, it just need a ip address, and a migration of servers (which is taking a awful lot of time to happen :/ ). One issue we would have with a mirror is on the download stats.=20 This and the need to have a mirrorlist, not sure how that's done on dnf/yum side theses days. --=20 Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS --=-AwrVs6Qx264n4bke8U9d Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABAgAGBQJXIKe3AAoJEE89Wa+PrSK9KBwP/1s8K+ZH5Zh6/L2AB+56qbRl YWWNcI6sFTACBBYe+djsSxMffPyblKahIBXjYmR1ggG4X+JLIy4pePlwGpRMLgXB JeEgUwhQHuNRnV9XmGGGP2vNCkRxubJ4Fkr5Lct/P8gCMHDyN/KlyzeEfDdk2GN1 L6w6LSpwMq/8fxFl7GMuu9528wjaONilEgONwa20w2UvNH9nnrqpmrRWJSKbFu9p uAOAcOt0HRDMBA6ZX52PCzOfbCpGntQHeOX1pZWpZOWYHi0/78nqLcCjQQrNy04b TvQ8nN19f8NXsUZAVp+0RIQurLCYnSkfzeiDxuKVnRAvO3cFEDB9ygx6I0xQwwgi pmum5CbdvZMZw8InPRlO787SYng/p/Q9a8g8x5A8m0Gjd0KqTkcBMb8nIUzRSu0+ dDE4RaIIRtfAi0knqJuRGizdDmZmejZr+SBL+ApdCxnBYhOGrTG8QBOcIEPCSHPD BQVCg31Zz3FjOjnCdw8/rCBYM1aAlpKoaGQRpJruZVPb4lRk1MhF9M2NNaPTqnqi 15jjQbu+d+jf556qkinAyjc/Car6AiV6HqUZzUt3N5I4k1ua45fzgtNEe3sZVQCD IHaKV1Xwo7h/QTD9wq2GWjUiwkVi1LwZKw5LByA7rJku9pVF0G5nbQumkW4GEKj8 gWHhM5oVxIdjkxh1h2Pn =IO9s -----END PGP SIGNATURE----- --=-AwrVs6Qx264n4bke8U9d--
participants (2)
-
Eyal Edri
-
Michael Scherer