Unexpected outage post mortem: ovirt.org website
Dave Neary
dneary at redhat.com
Thu Mar 21 16:08:52 UTC 2013
Hi everyone,
The oVirt website was down today for a period of time (I am not sure how
long - from the log files it looks like the outage started around 20h
EST on March 20th and finished at 8h EST on March 21st).
The issue was that we exceeded our disk quota on OpenShift. We have a
5GB disk allowance, and we had over 3GB of error log files from Apache.
The short-term fix was to free some disk space by deleting some old log
files. This will be a stop-gap solution for approx. a month, because we
are generating ~100M of error log files per day, due to some PHP errors
related to the theme used on the website.
A longer term band-aid would be to implement a logrotate scheme which
only keeps (say) 15 days worth of logs. If we do not fix the root issue,
and OpenShift does not provide us with a way to set a logrotate policy,
we need to plan this as a regular maintenance task.
The correct fix would be to upgrade the MediaWiki theme to include a
number of fixes which have already been included upstream. This solution
should happen as soon as possible, but the person ebest able to do this
(Garrett, the theme's author) will not have any time to spend on this
task for at least a month.
Regards,
Dave.
--
Dave Neary - Community Action and Impact
Open Source and Standards, Red Hat - http://community.redhat.com
Ph: +33 9 50 71 55 62 / Cell: +33 6 77 01 92 13
More information about the Infra
mailing list