Unexpected outage post mortem: ovirt.org website

Dave Neary dneary at redhat.com
Thu Mar 21 16:08:52 UTC 2013


Hi everyone,

The oVirt website was down today for a period of time (I am not sure how 
long - from the log files it looks like the outage started around 20h 
EST on March 20th and finished at 8h EST on March 21st).

The issue was that we exceeded our disk quota on OpenShift. We have a 
5GB disk allowance, and we had over 3GB of error log files from Apache.

The short-term fix was to free some disk space by deleting some old log 
files. This will be a stop-gap solution for approx. a month, because we 
are generating ~100M of error log files per day, due to some PHP errors 
related to the theme used on the website.

A longer term band-aid would be to implement a logrotate scheme which 
only keeps (say) 15 days worth of logs. If we do not fix the root issue, 
and OpenShift does not provide us with a way to set a logrotate policy, 
we need to plan this as a regular maintenance task.

The correct fix would be to upgrade the MediaWiki theme to include a 
number of fixes which have already been included upstream. This solution 
should happen as soon as possible, but the person ebest able to do this 
(Garrett, the theme's author) will not have any time to spend on this 
task for at least a month.

Regards,
Dave.

-- 
Dave Neary - Community Action and Impact
Open Source and Standards, Red Hat - http://community.redhat.com
Ph: +33 9 50 71 55 62 / Cell: +33 6 77 01 92 13



More information about the Infra mailing list