[Users] Wiki and Mailing Lists Outage -- 2012-11-14

We experienced an outage today in both the wiki and the mailing lists. * Wiki content was available throughout the outage, but attempts to login or edit received an error message about requiring cookies to be enabled. * All mails to the mailing list failed to show up on the lists, but also did not return rejection messages. Cause: This was caused by an "Out of Space" error on the host running both of these services. A temporary workaround was put in place to get both services up and running again. Action Taken: Remove the oldest gerrit backup (600MB) Remove some older non-functional ovirt-node-iso images and rpms from the releases (source remains there) Long term solution: Migrating these services away from a single host onto hosted solutions (OpenShift, AlterWay). Current Situation: Wiki is back up and running, login works as expected Lists are processing the backlog of emails since the outage began. At this time, it does not appear that any mail was lost due to the outage. Thanks for the patience and understanding Mike

Thanks Mike! I suggest to have a cron alerting for no-space issues. ----- Original Message -----
From: "Mike Burns" <mburns@redhat.com> To: "board" <board@ovirt.org>, "infra" <infra@ovirt.org>, "users" <users@ovirt.org> Sent: Wednesday, November 14, 2012 3:31:11 PM Subject: Wiki and Mailing Lists Outage -- 2012-11-14
We experienced an outage today in both the wiki and the mailing lists.
* Wiki content was available throughout the outage, but attempts to login or edit received an error message about requiring cookies to be enabled. * All mails to the mailing list failed to show up on the lists, but also did not return rejection messages.
Cause:
This was caused by an "Out of Space" error on the host running both of these services. A temporary workaround was put in place to get both services up and running again.
Action Taken:
Remove the oldest gerrit backup (600MB) Remove some older non-functional ovirt-node-iso images and rpms from the releases (source remains there)
Long term solution:
Migrating these services away from a single host onto hosted solutions (OpenShift, AlterWay).
Current Situation:
Wiki is back up and running, login works as expected Lists are processing the backlog of emails since the outage began. At this time, it does not appear that any mail was lost due to the outage.
Thanks for the patience and understanding
Mike
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

On Wed, 2012-11-14 at 08:45 -0500, Doron Fediuck wrote:
Thanks Mike! I suggest to have a cron alerting for no-space issues.
We run logwatch which is supposed to highlight these issues, but I suspect that no one is actually reading the logwatch report. A separate cron job or monitoring service is also a possibility. Mike
----- Original Message -----
From: "Mike Burns" <mburns@redhat.com> To: "board" <board@ovirt.org>, "infra" <infra@ovirt.org>, "users" <users@ovirt.org> Sent: Wednesday, November 14, 2012 3:31:11 PM Subject: Wiki and Mailing Lists Outage -- 2012-11-14
We experienced an outage today in both the wiki and the mailing lists.
* Wiki content was available throughout the outage, but attempts to login or edit received an error message about requiring cookies to be enabled. * All mails to the mailing list failed to show up on the lists, but also did not return rejection messages.
Cause:
This was caused by an "Out of Space" error on the host running both of these services. A temporary workaround was put in place to get both services up and running again.
Action Taken:
Remove the oldest gerrit backup (600MB) Remove some older non-functional ovirt-node-iso images and rpms from the releases (source remains there)
Long term solution:
Migrating these services away from a single host onto hosted solutions (OpenShift, AlterWay).
Current Situation:
Wiki is back up and running, login works as expected Lists are processing the backlog of emails since the outage began. At this time, it does not appear that any mail was lost due to the outage.
Thanks for the patience and understanding
Mike
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
_______________________________________________ Board mailing list Board@ovirt.org http://lists.ovirt.org/mailman/listinfo/board

we are running a disk space job on jenkins slave: http://jenkins.ovirt.org/view/system-monitoring/job/check_disk_space_on_jenk... it runs a script [1], i guess we can clone this to check other infra servers as well.. [1] #!/bin/sh df -H | grep -vE '^Filesystem|tmpfs|cdrom|file.tlv|loop' | awk '{ print $5 " " $1 }' | while read output; do echo $output usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1 ) partition=$(echo $output | awk '{ print $2 }' ) if [ $usep -ge 90 ]; then echo "Running out of space \"$partition ($usep%)\" on $(hostname) as on $(date)" exit 1 fi done ----- Original Message -----
From: "Mike Burns" <mburns@redhat.com> To: "Doron Fediuck" <dfediuck@redhat.com> Cc: "infra" <infra@ovirt.org>, "users" <users@ovirt.org>, "board" <board@ovirt.org> Sent: Wednesday, November 14, 2012 3:58:17 PM Subject: Re: Wiki and Mailing Lists Outage -- 2012-11-14
On Wed, 2012-11-14 at 08:45 -0500, Doron Fediuck wrote:
Thanks Mike! I suggest to have a cron alerting for no-space issues.
We run logwatch which is supposed to highlight these issues, but I suspect that no one is actually reading the logwatch report. A separate cron job or monitoring service is also a possibility.
Mike
----- Original Message -----
From: "Mike Burns" <mburns@redhat.com> To: "board" <board@ovirt.org>, "infra" <infra@ovirt.org>, "users" <users@ovirt.org> Sent: Wednesday, November 14, 2012 3:31:11 PM Subject: Wiki and Mailing Lists Outage -- 2012-11-14
We experienced an outage today in both the wiki and the mailing lists.
* Wiki content was available throughout the outage, but attempts to login or edit received an error message about requiring cookies to be enabled. * All mails to the mailing list failed to show up on the lists, but also did not return rejection messages.
Cause:
This was caused by an "Out of Space" error on the host running both of these services. A temporary workaround was put in place to get both services up and running again.
Action Taken:
Remove the oldest gerrit backup (600MB) Remove some older non-functional ovirt-node-iso images and rpms from the releases (source remains there)
Long term solution:
Migrating these services away from a single host onto hosted solutions (OpenShift, AlterWay).
Current Situation:
Wiki is back up and running, login works as expected Lists are processing the backlog of emails since the outage began. At this time, it does not appear that any mail was lost due to the outage.
Thanks for the patience and understanding
Mike
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
_______________________________________________ Board mailing list Board@ovirt.org http://lists.ovirt.org/mailman/listinfo/board
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
participants (3)
-
Doron Fediuck
-
Eyal Edri
-
Mike Burns