Re: [DRAFT] Outage :: No disk space :: 2012-08-30

1 Sep 2012


      On 09/01/2012 04:43 PM, Mike Burns wrote:
...
----- Original Message -----
...
Hi:
I didn't really participate in this outage, so I thought others could
help us draft up notes about it. I put some barebones below.
One outcome we need to look at is, what do people do when they
perceive
services are out?
Of course, if the service is the wiki, they can't check that for what
to
do ...
How do we communicate when major communication services of ovirt.org
are
down? IRC is great but not enough ... If we can arrange for a
reliable
third-party mail relay to alias a page to the Infra team, great, but
how
do we keep it from getting spam?
Another angle to resolve is service monitoring so we know when things
go
out rather than waiting for service users to tell us. I got some
direct
emails from people (since the infra@ list wasn't working), but I was
unavailable and unaware of the problem until Robert called me when he
was working on fixing it. I don't mind getting pager alerts, as long
as
we can tune things so they are not crazy often. :)
I think we should have multiple places that we notify.
1.  IRC
2.  wiki page
3.  infra list
4.  someplace on wordpress (preferably the main page).
...
This should be sufficient long term (i.e. once we have a better hosting solution than just the kitchen sink box.
I agree with getting service monitoring set up as well.  We can even accomplish this to a certain extent with jenkins (and a separate non-jenkins cron job to monitor jenkins).
Mike
Monitor requires hardware unless someone is able to add it to there 
existing system.  I use a very basic monitor inside my Cerberus Email 
Response / Help desk system.  We can also add something to do basic jobs 
inside Jenkins.  I have a 5 Seat license for Cerberus that would allow 
us to have a ticketing / monitoring system it is written in php and
I agree that is a good list.  We need to make sure that we have each 
other cell phone info in the event we need to get a hold of someone 
although that might be hard for the people in diff county's al
pretty easy to customize.  I am heavenly involved in the project and I 
am already running Cerberus on my own VPS.
...
...
== What occurred ==
Even the doubled disk space on linode01.ovirt.ort (to 25 GB) wasn't
enough to last long.
I made a mistake and the ripper script wasn't purging old ovirt-node.iso 
using up a lot of space.
== When ==
XXXX?
date -d "2012-08-30 XXXX UTC"
== Affected services ==
lists.ovirt.org
wiki.ovirt.org
ovirt.org/.*
ovirtbot
Gerritt backup
Jenkins backup
[[What else?]]
== Responses to take ==
* Get new hosting solution in place.
* Double current disk space before new hosting move, to give us room
to
breath.
* Work up a response place that is posted in the IRC topic or
somewhere
good so people know how to contact all of the Infra team when
something
is happening.
* New service need: monitoring server
Fix the reaper script to properly purge the nightly files.
Thanks
Robert
...
...
--
Karsten 'quaid' Wade, Sr. Analyst - Community Growth
http://TheOpenSourceWay.org  .^\  http://community.redhat.com
@quaid (identi.ca/twitter/IRC)  \v'  gpg: AD0E0C41
_______________________________________________
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra
_______________________________________________
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra
-- 
Thanks
Robert Middleswarth
@rmiddle (twitter/Freenode IRC)
@RobertM (OFTC IRC)