On 09/01/2012 04:43 PM, Mike Burns wrote:
----- Original Message -----
> Hi:
>
> I didn't really participate in this outage, so I thought others could
> help us draft up notes about it. I put some barebones below.
>
> One outcome we need to look at is, what do people do when they
> perceive
> services are out?
>
> Of course, if the service is the wiki, they can't check that for what
> to
> do ...
>
> How do we communicate when major communication services of
ovirt.org
> are
> down? IRC is great but not enough ... If we can arrange for a
> reliable
> third-party mail relay to alias a page to the Infra team, great, but
> how
> do we keep it from getting spam?
>
> Another angle to resolve is service monitoring so we know when things
> go
> out rather than waiting for service users to tell us. I got some
> direct
> emails from people (since the infra@ list wasn't working), but I was
> unavailable and unaware of the problem until Robert called me when he
> was working on fixing it. I don't mind getting pager alerts, as long
> as
> we can tune things so they are not crazy often. :)
I think we should have multiple places that we notify.
1. IRC
2. wiki page
3. infra list
4. someplace on wordpress (preferably the main page).
I agree that is a good list.
We need to make sure that we have each
other cell phone info in the event we need to get a hold of someone
although that might be hard for the people in diff county's al
This should be sufficient long term (i.e. once we have a better
hosting solution than just the kitchen sink box.
I agree with getting service monitoring set up as well. We can even accomplish this to a
certain extent with jenkins (and a separate non-jenkins cron job to monitor jenkins).
Mike
Monitor requires hardware unless someone is able to add it to there
existing system. I use a very basic monitor inside my Cerberus Email
Response / Help desk system. We can also add something to do basic jobs
inside Jenkins. I have a 5 Seat license for Cerberus that would allow
us to have a ticketing / monitoring system it is written in php and
pretty easy to customize. I am heavenly involved in the project and I
am already running Cerberus on my own VPS.
> == What occurred ==
>
> Even the doubled disk space on linode01.ovirt.ort (to 25 GB) wasn't
> enough to last long.
I made a mistake and the ripper script wasn't purging
old ovirt-node.iso
using up a lot of space.
>
> == When ==
>
> XXXX?
>
> date -d "2012-08-30 XXXX UTC"
>
> == Affected services ==
>
>
lists.ovirt.org
>
wiki.ovirt.org
>
ovirt.org/.*
> ovirtbot
> Gerritt backup
> Jenkins backup
> [[What else?]]
>
> == Responses to take ==
>
> * Get new hosting solution in place.
> * Double current disk space before new hosting move, to give us room
> to
> breath.
> * Work up a response place that is posted in the IRC topic or
> somewhere
> good so people know how to contact all of the Infra team when
> something
> is happening.
> * New service need: monitoring server
Fix the reaper script to properly purge
the nightly files.
Thanks
Robert
>
> --
> Karsten 'quaid' Wade, Sr. Analyst - Community Growth
>
http://TheOpenSourceWay.org .^\
http://community.redhat.com
> @quaid (identi.ca/twitter/IRC) \v' gpg: AD0E0C41
>
>
> _______________________________________________
> Infra mailing list
> Infra(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/infra
>
_______________________________________________
Infra mailing list
Infra(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra
--
Thanks
Robert Middleswarth
@rmiddle (twitter/Freenode IRC)
@RobertM (OFTC IRC)