On 09/01/2012 04:43 PM, Mike Burns wrote:
----- Original Message -----
> I didn't really participate in this outage, so I thought others could
> help us draft up notes about it. I put some barebones below.
> One outcome we need to look at is, what do people do when they
> services are out?
> Of course, if the service is the wiki, they can't check that for what
> do ...
> How do we communicate when major communication services of ovirt.org
> down? IRC is great but not enough ... If we can arrange for a
> third-party mail relay to alias a page to the Infra team, great, but
> do we keep it from getting spam?
> Another angle to resolve is service monitoring so we know when things
> out rather than waiting for service users to tell us. I got some
> emails from people (since the infra@ list wasn't working), but I was
> unavailable and unaware of the problem until Robert called me when he
> was working on fixing it. I don't mind getting pager alerts, as long
> we can tune things so they are not crazy often. :)
I think we should have multiple places that we notify.
2. wiki page
3. infra list
4. someplace on wordpress (preferably the main page).
I agree that is a good list.
We need to make sure that we have each
other cell phone info in the event we need to get a hold of someone
although that might be hard for the people in diff county's al
This should be sufficient long term (i.e. once we have a better
hosting solution than just the kitchen sink box.
I agree with getting service monitoring set up as well. We can even accomplish this to a
certain extent with jenkins (and a separate non-jenkins cron job to monitor jenkins).
Monitor requires hardware unless someone is able to add it to there
existing system. I use a very basic monitor inside my Cerberus Email
Response / Help desk system. We can also add something to do basic jobs
inside Jenkins. I have a 5 Seat license for Cerberus that would allow
us to have a ticketing / monitoring system it is written in php and
pretty easy to customize. I am heavenly involved in the project and I
am already running Cerberus on my own VPS.
> == What occurred ==
> Even the doubled disk space on linode01.ovirt.ort (to 25 GB) wasn't
> enough to last long.
I made a mistake and the ripper script wasn't purging
using up a lot of space.
> == When ==
> date -d "2012-08-30 XXXX UTC"
> == Affected services ==
> Gerritt backup
> Jenkins backup
> [[What else?]]
> == Responses to take ==
> * Get new hosting solution in place.
> * Double current disk space before new hosting move, to give us room
> * Work up a response place that is posted in the IRC topic or
> good so people know how to contact all of the Infra team when
> is happening.
> * New service need: monitoring server
Fix the reaper script to properly purge
the nightly files.
> Karsten 'quaid' Wade, Sr. Analyst - Community Growth
> @quaid (identi.ca/twitter/IRC) \v' gpg: AD0E0C41
> Infra mailing list
Infra mailing list
@rmiddle (twitter/Freenode IRC)
@RobertM (OFTC IRC)