Wiki and Mailing Lists Outage -- 2012-11-14

Wed Nov 14 15:17:53 UTC 2012

we are running a disk space job on jenkins slave: http://jenkins.ovirt.org/view/system-monitoring/job/check_disk_space_on_jenkins_slaves

it runs a script [1], i guess we can clone this to check other infra servers as well.. 

[1]
#!/bin/sh
df -H | grep -vE '^Filesystem|tmpfs|cdrom|file.tlv|loop' | awk '{ print $5 " " $1 }' | while read output;
do
  echo $output
  usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1  )
  partition=$(echo $output | awk '{ print $2 }' )
  if [ $usep -ge 90 ]; then
    echo "Running out of space \"$partition ($usep%)\" on $(hostname) as on $(date)" 
    exit 1
  fi
done

----- Original Message -----
> From: "Mike Burns" <mburns at redhat.com>
> To: "Doron Fediuck" <dfediuck at redhat.com>
> Cc: "infra" <infra at ovirt.org>, "users" <users at ovirt.org>, "board" <board at ovirt.org>
> Sent: Wednesday, November 14, 2012 3:58:17 PM
> Subject: Re: Wiki and Mailing Lists Outage -- 2012-11-14
> 
> On Wed, 2012-11-14 at 08:45 -0500, Doron Fediuck wrote:
> > Thanks Mike!
> > I suggest to have a cron alerting for no-space issues.
> 
> We run logwatch which is supposed to highlight these issues, but I
> suspect that no one is actually reading the logwatch report.  A
> separate
> cron job or monitoring service is also a possibility.
> 
> Mike
> > 
> > ----- Original Message -----
> > > From: "Mike Burns" <mburns at redhat.com>
> > > To: "board" <board at ovirt.org>, "infra" <infra at ovirt.org>, "users"
> > > <users at ovirt.org>
> > > Sent: Wednesday, November 14, 2012 3:31:11 PM
> > > Subject: Wiki and Mailing Lists Outage -- 2012-11-14
> > > 
> > > We experienced an outage today in both the wiki and the mailing
> > > lists.
> > > 
> > > * Wiki content was available throughout the outage, but attempts
> > > to
> > > login or edit received an error message about requiring cookies
> > > to be
> > > enabled.
> > > * All mails to the mailing  list failed to show up on the lists,
> > > but
> > > also did not return rejection messages.
> > > 
> > > Cause:
> > > 
> > > This was caused by an "Out of Space" error on the host running
> > > both
> > > of
> > > these services.  A temporary workaround was put in place to get
> > > both
> > > services up and running again.
> > > 
> > > 
> > > Action Taken:
> > > 
> > > Remove the oldest gerrit backup (600MB)
> > > Remove some older non-functional ovirt-node-iso images and rpms
> > > from
> > > the
> > > releases (source remains there)
> > > 
> > > Long term solution:
> > > 
> > > Migrating these services away from a single host onto hosted
> > > solutions
> > > (OpenShift, AlterWay).
> > > 
> > > Current Situation:
> > > 
> > > Wiki is back up and running, login works as expected
> > > Lists are processing the backlog of emails since the outage
> > > began.
> > > At this time, it does not appear that any mail was lost due to
> > > the
> > > outage.
> > > 
> > > 
> > > Thanks for the patience and understanding
> > > 
> > > Mike
> > > 
> > > _______________________________________________
> > > Infra mailing list
> > > Infra at ovirt.org
> > > http://lists.ovirt.org/mailman/listinfo/infra
> > > 
> > _______________________________________________
> > Board mailing list
> > Board at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/board
> 
> 
> _______________________________________________
> Infra mailing list
> Infra at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>