[DRAFT] Outage :: No disk space :: 2012-08-30
Robert Middleswarth
robert at middleswarth.net
Sun Sep 2 02:26:14 UTC 2012
On 09/01/2012 04:43 PM, Mike Burns wrote:
>
> ----- Original Message -----
>> Hi:
>>
>> I didn't really participate in this outage, so I thought others could
>> help us draft up notes about it. I put some barebones below.
>>
>> One outcome we need to look at is, what do people do when they
>> perceive
>> services are out?
>>
>> Of course, if the service is the wiki, they can't check that for what
>> to
>> do ...
>>
>> How do we communicate when major communication services of ovirt.org
>> are
>> down? IRC is great but not enough ... If we can arrange for a
>> reliable
>> third-party mail relay to alias a page to the Infra team, great, but
>> how
>> do we keep it from getting spam?
>>
>> Another angle to resolve is service monitoring so we know when things
>> go
>> out rather than waiting for service users to tell us. I got some
>> direct
>> emails from people (since the infra@ list wasn't working), but I was
>> unavailable and unaware of the problem until Robert called me when he
>> was working on fixing it. I don't mind getting pager alerts, as long
>> as
>> we can tune things so they are not crazy often. :)
>
> I think we should have multiple places that we notify.
>
> 1. IRC
> 2. wiki page
> 3. infra list
> 4. someplace on wordpress (preferably the main page).
I agree that is a good list. We need to make sure that we have each
other cell phone info in the event we need to get a hold of someone
although that might be hard for the people in diff county's al
> This should be sufficient long term (i.e. once we have a better hosting solution than just the kitchen sink box.
>
> I agree with getting service monitoring set up as well. We can even accomplish this to a certain extent with jenkins (and a separate non-jenkins cron job to monitor jenkins).
>
> Mike
Monitor requires hardware unless someone is able to add it to there
existing system. I use a very basic monitor inside my Cerberus Email
Response / Help desk system. We can also add something to do basic jobs
inside Jenkins. I have a 5 Seat license for Cerberus that would allow
us to have a ticketing / monitoring system it is written in php and
pretty easy to customize. I am heavenly involved in the project and I
am already running Cerberus on my own VPS.
>> == What occurred ==
>>
>> Even the doubled disk space on linode01.ovirt.ort (to 25 GB) wasn't
>> enough to last long.
I made a mistake and the ripper script wasn't purging old ovirt-node.iso
using up a lot of space.
>>
>> == When ==
>>
>> XXXX?
>>
>> date -d "2012-08-30 XXXX UTC"
>>
>> == Affected services ==
>>
>> lists.ovirt.org
>> wiki.ovirt.org
>> ovirt.org/.*
>> ovirtbot
>> Gerritt backup
>> Jenkins backup
>> [[What else?]]
>>
>> == Responses to take ==
>>
>> * Get new hosting solution in place.
>> * Double current disk space before new hosting move, to give us room
>> to
>> breath.
>> * Work up a response place that is posted in the IRC topic or
>> somewhere
>> good so people know how to contact all of the Infra team when
>> something
>> is happening.
>> * New service need: monitoring server
Fix the reaper script to properly purge the nightly files.
Thanks
Robert
>>
>> --
>> Karsten 'quaid' Wade, Sr. Analyst - Community Growth
>> http://TheOpenSourceWay.org .^\ http://community.redhat.com
>> @quaid (identi.ca/twitter/IRC) \v' gpg: AD0E0C41
>>
>>
>> _______________________________________________
>> Infra mailing list
>> Infra at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/infra
>>
> _______________________________________________
> Infra mailing list
> Infra at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
--
Thanks
Robert Middleswarth
@rmiddle (twitter/Freenode IRC)
@RobertM (OFTC IRC)
More information about the Infra
mailing list