Outage Update - www.ovirt.org and gerrit.ovirt.org - Infra

newer
[Jenkins] new high priority bug...

Outage Update - www.ovirt.org and gerrit.ovirt.org

Ofer Schreiber

20 Mar 2012 20 Mar '12

3:18 p.m.

www.ovirt.org and gerrit.ovirt.org are now up and running. We experienced two issues: 1. DB corruption on www.ovirt.org, caused by a full file system. 2. Faulty gerrit service, probably caused by #1. Both issues were handled by oVirt infra team (mburns, quaid and myself) Thank you for your patience. Ofer Schreiber oVirt infra team

Show replies by date

Eyal Edri

20 Mar 20 Mar

4:22 p.m.

If jenkins.ovirt.org will have access to the other servers, we might be able to add system jobs that deleted old files and such, i do it downstream to delete old files from multiple dirs on jenkins slaves. running a cmd like: 'sudo find . -type f -mtime +${days_to_keep} |grep -v ^\.$| sudo xargs rm -rf' where days_to_keep is a param we can change. of course, we'll have to make sure jenkins has write/delete permission to the monitored dirs. ----- Original Message -----

...

From: "Ofer Schreiber" <oschreib@redhat.com> To: "users" <users@ovirt.org>, arch@ovirt.org, infra@ovirt.org Sent: Tuesday, March 20, 2012 4:18:57 PM Subject: Outage Update - www.ovirt.org and gerrit.ovirt.org

www.ovirt.org and gerrit.ovirt.org are now up and running.

We experienced two issues: 1. DB corruption on www.ovirt.org, caused by a full file system. 2. Faulty gerrit service, probably caused by #1.

Both issues were handled by oVirt infra team (mburns, quaid and myself)

Thank you for your patience.

Ofer Schreiber oVirt infra team _______________________________________________ Arch mailing list Arch@ovirt.org http://lists.ovirt.org/mailman/listinfo/arch

Mike Burns

5:48 p.m.

On Tue, 2012-03-20 at 11:22 -0400, Eyal Edri wrote:

...

If jenkins.ovirt.org will have access to the other servers, we might be able to add system jobs that deleted old files and such,

i do it downstream to delete old files from multiple dirs on jenkins slaves.

running a cmd like: 'sudo find . -type f -mtime +${days_to_keep} |grep -v ^\.$| sudo xargs rm -rf'

I think a command like this works well. I have something similar setup on similar backup area. I don't think we need jenkins to run this though. A simple cron job with output mails going to root (which get sent to infra list admins already) should be sufficient. Mike

...

where days_to_keep is a param we can change. of course, we'll have to make sure jenkins has write/delete permission to the monitored dirs.

----- Original Message -----

...
From: "Ofer Schreiber" <oschreib@redhat.com> To: "users" <users@ovirt.org>, arch@ovirt.org, infra@ovirt.org Sent: Tuesday, March 20, 2012 4:18:57 PM Subject: Outage Update - www.ovirt.org and gerrit.ovirt.org

www.ovirt.org and gerrit.ovirt.org are now up and running.

We experienced two issues: 1. DB corruption on www.ovirt.org, caused by a full file system. 2. Faulty gerrit service, probably caused by #1.

Both issues were handled by oVirt infra team (mburns, quaid and myself)

Thank you for your patience.

Ofer Schreiber oVirt infra team _______________________________________________ Arch mailing list Arch@ovirt.org http://lists.ovirt.org/mailman/listinfo/arch

_______________________________________________ Arch mailing list Arch@ovirt.org http://lists.ovirt.org/mailman/listinfo/arch

Karsten 'quaid' Wade

27 Mar 27 Mar

1:21 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/20/2012 09:48 AM, Mike Burns wrote:

...

A simple cron job with output mails going to root (which get sent to infra list admins already) should be sufficient.

Actually, logwatch is trying to send through to this list, and I've enabled it to received by default, but the inbound email has "explicit address" or something that Mailman doesn't like. This is why we have an endless supply in the moderation queue, and occasionally I flush the queue to this list. Ideally, we'd all be watching the logwatch here, etc. - - Karsten - -- name: Karsten 'quaid' Wade, Sr. Community Architect team: Red Hat Community Architecture & Leadership uri: http://communityleadershipteam.org http://TheOpenSourceWay.org gpg: AD0E0C41 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFPcPny2ZIOBq0ODEERAiH5AKCNuwnH8xGsTuIrBI518BcpkJJenwCgwEEZ 6ptLboOOudZUdvQwJc5E0Do= =w3JG -----END PGP SIGNATURE-----

Karsten 'quaid' Wade

1:18 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/20/2012 08:22 AM, Eyal Edri wrote:

...

If jenkins.ovirt.org will have access to the other servers, we might be able to add system jobs that deleted old files and such,

That's an interesting idea. Is that a good way to handle this sort of thing? Akin to the way Puppet or Chef handle configurations?

...

i do it downstream to delete old files from multiple dirs on jenkins slaves.

running a cmd like: 'sudo find . -type f -mtime +${days_to_keep} |grep -v ^\.$| sudo xargs rm -rf'

OK, I just put that in a small shell script (below) that I put in root's crontab to run daily. I know things continue to be a bit hacky. Jason Brooks and I have been having discussions about how we can make it easier and more scalable to spin up project infrastructure, as this piecemeal approach is feeling organically cobbled-together instead of following a good plan. Maybe organic is fine, but it would help if we could just grab what we needed, as we needed it (planet? check. jenkins? check. etc.) without having to worry about all the infrastructure around it. To that end, Jason has been spinning up services using OpenShift quickstarts. - - Karsten - -- name: Karsten 'quaid' Wade, Sr. Community Architect team: Red Hat Community Architecture & Leadership uri: http://communityleadershipteam.org http://TheOpenSourceWay.org gpg: AD0E0C41 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFPcPkw2ZIOBq0ODEERAmMsAJ4jOdCRG+ey0f8sZyzmxT5uLiJLCwCfQInn q1aWIaGWCrUHQTt3YAgtHo0= =Vve9 -----END PGP SIGNATURE-----

Itamar Heim

8:11 a.m.

On 03/27/2012 01:18 AM, Karsten 'quaid' Wade wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 03/20/2012 08:22 AM, Eyal Edri wrote:

...
If jenkins.ovirt.org will have access to the other servers, we might be able to add system jobs that deleted old files and such,

That's an interesting idea. Is that a good way to handle this sort of thing? Akin to the way Puppet or Chef handle configurations?

...
i do it downstream to delete old files from multiple dirs on jenkins slaves.

running a cmd like: 'sudo find . -type f -mtime +${days_to_keep} |grep -v ^\.$| sudo xargs rm -rf'

OK, I just put that in a small shell script (below) that I put in root's crontab to run daily.

I know things continue to be a bit hacky. Jason Brooks and I have been having discussions about how we can make it easier and more scalable to spin up project infrastructure, as this piecemeal approach is feeling organically cobbled-together instead of following a good plan. Maybe organic is fine, but it would help if we could just grab what we needed, as we needed it (planet? check. jenkins? check. etc.) without having to worry about all the infrastructure around it. To that end, Jason has been spinning up services using OpenShift quickstarts.

we started with openshift for jenkins: 1. it still needs a few features to allow it to work for the scale/space we need. 2. we need slaves which are bare metal for some of the tests.

Eyal Edri

11:13 a.m.

----- Original Message -----

...

From: "Karsten 'quaid' Wade" <kwade@redhat.com> To: infra@ovirt.org Sent: Tuesday, March 27, 2012 1:18:09 AM Subject: Re: Outage Update - www.ovirt.org and gerrit.ovirt.org

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 03/20/2012 08:22 AM, Eyal Edri wrote:

...
If jenkins.ovirt.org will have access to the other servers, we might be able to add system jobs that deleted old files and such,

That's an interesting idea. Is that a good way to handle this sort of thing? Akin to the way Puppet or Chef handle configurations?

Well, it was suggested on jenkins mailing lists and irc channel as a good quick solution for monitoring system jobs. It has numerous plugins that can suit all sorts of administration tasks. Jenkins has an option of monitoring an external job (like cron jobs) [1], so that can be used also. also, found this interesting blog about using jenkins for system tasks [2]. Can't say if using jenkins is better than having a nagios/cacti or any other monitoring service, i guess it should be considered as one more option to solve the monitoring problem. IMO i don't think puppet is a replacement for nagios/monitoring, it's more a tool to make sure all your severs are aligned to your needs (rpms/repos/services running,etc...). [1] https://wiki.jenkins-ci.org/display/JENKINS/Monitoring+external+jobs [2] http://morgajel.net/2011/12/12/1108 Eyal.

...

...
i do it downstream to delete old files from multiple dirs on jenkins slaves.

running a cmd like: 'sudo find . -type f -mtime +${days_to_keep} |grep -v ^\.$| sudo xargs rm -rf'

OK, I just put that in a small shell script (below) that I put in root's crontab to run daily.

I know things continue to be a bit hacky. Jason Brooks and I have been having discussions about how we can make it easier and more scalable to spin up project infrastructure, as this piecemeal approach is feeling organically cobbled-together instead of following a good plan. Maybe organic is fine, but it would help if we could just grab what we needed, as we needed it (planet? check. jenkins? check. etc.) without having to worry about all the infrastructure around it. To that end, Jason has been spinning up services using OpenShift quickstarts.

- - Karsten - -- name: Karsten 'quaid' Wade, Sr. Community Architect team: Red Hat Community Architecture & Leadership uri: http://communityleadershipteam.org http://TheOpenSourceWay.org gpg: AD0E0C41 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFPcPkw2ZIOBq0ODEERAmMsAJ4jOdCRG+ey0f8sZyzmxT5uLiJLCwCfQInn q1aWIaGWCrUHQTt3YAgtHo0= =Vve9 -----END PGP SIGNATURE----- _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

Karsten 'quaid' Wade

8:48 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/27/2012 02:13 AM, Eyal Edri wrote:

...

...
Well, it was suggested on jenkins mailing lists and irc channel as a good quick solution for monitoring system jobs. It has numerous plugins that can suit all sorts of administration tasks.

Worth thinking about, especially if the project continues using Jenkins, then we keep our expertises in the same area.

...

...
Jenkins has an option of monitoring an external job (like cron jobs) [1], so that can be used also. also, found this interesting blog about using jenkins for system tasks [2].

...
Can't say if using jenkins is better than having a nagios/cacti or any other monitoring service, i guess it should be considered as one more option to solve the monitoring problem.

...
IMO i don't think puppet is a replacement for nagios/monitoring, it's more a tool to make sure all your severs are aligned to your needs (rpms/repos/services running,etc...).

Right, I was making a comparison in terms of what Puppet does for configuration, Jenkins could do for monitoring system tasks. That of course is all separate from what Nagios does for monitoring uptime of systems and services. Three separate roles. - - Karsten

...

...
[1] https://wiki.jenkins-ci.org/display/JENKINS/Monitoring+external+jobs

[2] http://morgajel.net/2011/12/12/1108

...

...
Eyal.

...
...
...
i do it downstream to delete old files from multiple dirs on jenkins slaves.

running a cmd like: 'sudo find . -type f -mtime +${days_to_keep} |grep -v ^\.$| sudo xargs rm -rf'

OK, I just put that in a small shell script (below) that I put in root's crontab to run daily.

I know things continue to be a bit hacky. Jason Brooks and I have been having discussions about how we can make it easier and more scalable to spin up project infrastructure, as this piecemeal approach is feeling organically cobbled-together instead of following a good plan. Maybe organic is fine, but it would help if we could just grab what we needed, as we needed it (planet? check. jenkins? check. etc.) without having to worry about all the infrastructure around it. To that end, Jason has been spinning up services using OpenShift quickstarts.

- Karsten

...
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

- -- name: Karsten 'quaid' Wade, Sr. Community Architect team: Red Hat Community Architecture & Leadership uri: http://communityleadershipteam.org http://TheOpenSourceWay.org gpg: AD0E0C41 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFPcguG2ZIOBq0ODEERAqPUAKCM1v5s3CJlReelpFxupx5Nu49RqgCfUIXk Cz3ZYhWham8j3Ot1AD6tJNQ= =7CSd -----END PGP SIGNATURE-----

Karsten 'quaid' Wade

20 Mar 20 Mar

4:28 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/20/2012 07:18 AM, Ofer Schreiber wrote:

...

www.ovirt.org and gerrit.ovirt.org are now up and running.

We experienced two issues: 1. DB corruption on www.ovirt.org, caused by a full file system. 2. Faulty gerrit service, probably caused by #1.

Both issues were handled by oVirt infra team (mburns, quaid and myself)

I'm in a meeting all day today so I won't pause for a single root-cause analysis, but instead just dump pieces as I go. As a first step, I'm fixing the easy mistakes, which includes that we didn't have a straightforward backup of the MediaWiki and WordPress databases. (We do have a daily Linode backup, but that's the painful way.) As a stop-gap, I setup a few bash scripts to run every day to grab the database. crontab -e # Give root word about the backup MAILTO=root # # Run five minutes after Midnight Eastern at quietest time, every day 5 0 * * * /root/bin/wordpress-backup.sh # Run ten minutes after Midnight Eastern at quietest time, every day 10 0 * * * /root/bin/mediawiki-backup.sh The root cause today was a fillup of /home/gerrit-backup/gerrit.ovirt.org-gerrit2-home-backup/ which has a daily snapshot of everything-that-is-gerrit. The problem is, I didn't build a clean-up for those backups, so they went back to January when I did the last manual clean-up. Gerrit probably fellover when trying to do the rsync of its backup to linode01.ovirt.org. That's the only way these two servers interact that I recall. So we need a cleanup script to run in cron.weekly or cron.daily to erase the old backups. We also need a script to rsync out the daily backup of the databases (and maybe other useful bits such as /var/www/html/w and /usr/share/wordpress.) We could copy this back over to gerrit.ovirt.org. Umm, hacky, but would work. And be better than the current situation. There is so little disk space on linode01.ovirt.org because I never intended to use that host this long. I've been working to find a better solution, preferably one running on KVM. :) and ideally provided by e.g. one of the sponsors. - - Karsten - -- name: Karsten 'quaid' Wade, Sr. Community Architect team: Red Hat Community Architecture & Leadership uri: http://communityleadershipteam.org http://TheOpenSourceWay.org gpg: AD0E0C41 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFPaKI72ZIOBq0ODEERAhnAAKCvNMDHxxG3IR2rDBBarqsn7V/UAACg4HIo T9rM2fCZTCdpDGrQsz/Xq2o= =vzYt -----END PGP SIGNATURE-----

Mike Burns

5:50 p.m.

On Tue, 2012-03-20 at 08:28 -0700, Karsten 'quaid' Wade wrote:

...

On 03/20/2012 07:18 AM, Ofer Schreiber wrote:

...
www.ovirt.org and gerrit.ovirt.org are now up and running.

We experienced two issues: 1. DB corruption on www.ovirt.org, caused by a full file system. 2. Faulty gerrit service, probably caused by #1.

Both issues were handled by oVirt infra team (mburns, quaid and myself)

I'm in a meeting all day today so I won't pause for a single root-cause analysis, but instead just dump pieces as I go.

As a first step, I'm fixing the easy mistakes, which includes that we didn't have a straightforward backup of the MediaWiki and WordPress databases. (We do have a daily Linode backup, but that's the painful way.)

As a stop-gap, I setup a few bash scripts to run every day to grab the database.

crontab -e # Give root word about the backup MAILTO=root # # Run five minutes after Midnight Eastern at quietest time, every day 5 0 * * * /root/bin/wordpress-backup.sh # Run ten minutes after Midnight Eastern at quietest time, every day 10 0 * * * /root/bin/mediawiki-backup.sh

The root cause today was a fillup of /home/gerrit-backup/gerrit.ovirt.org-gerrit2-home-backup/ which has a daily snapshot of everything-that-is-gerrit.

The problem is, I didn't build a clean-up for those backups, so they went back to January when I did the last manual clean-up.

Gerrit probably fellover when trying to do the rsync of its backup to linode01.ovirt.org. That's the only way these two servers interact that I recall.

So we need a cleanup script to run in cron.weekly or cron.daily to erase the old backups.

See the other reply on this thread from Eyal. There is a handy find script that will work well for this. Note: we need it on the gerrit server as well if we use that as the backup server for the www site backup. Mike

...

We also need a script to rsync out the daily backup of the databases (and maybe other useful bits such as /var/www/html/w and /usr/share/wordpress.) We could copy this back over to gerrit.ovirt.org.

Umm, hacky, but would work. And be better than the current situation.

There is so little disk space on linode01.ovirt.org because I never intended to use that host this long. I've been working to find a better solution, preferably one running on KVM. :) and ideally provided by e.g. one of the sponsors.

- Karsten _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

xrx

25 Mar 25 Mar

7:37 p.m.

On 03/20/12 18:18, Ofer Schreiber wrote:

...

www.ovirt.org and gerrit.ovirt.org are now up and running.

We experienced two issues: 1. DB corruption on www.ovirt.org, caused by a full file system. 2. Faulty gerrit service, probably caused by #1.

I highly recommend having Nagios (or it's fork Icinga) for server monitoring; it would have warned you of the FS being full well beforehand. -xrx

...

Both issues were handled by oVirt infra team (mburns, quaid and myself)

Thank you for your patience.

Ofer Schreiber oVirt infra team _______________________________________________ Arch mailing list Arch@ovirt.org http://lists.ovirt.org/mailman/listinfo/arch

Karsten 'quaid' Wade

27 Mar 27 Mar

1 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/25/2012 10:37 AM, xrx wrote:

...

On 03/20/12 18:18, Ofer Schreiber wrote:

...
www.ovirt.org and gerrit.ovirt.org are now up and running.

We experienced two issues: 1. DB corruption on www.ovirt.org, caused by a full file system. 2. Faulty gerrit service, probably caused by #1.

I highly recommend having Nagios (or it's fork Icinga) for server monitoring; it would have warned you of the FS being full well beforehand.

Agreed. Anyone interested in setting this up for us? - - Karsten

...

-xrx

...
Both issues were handled by oVirt infra team (mburns, quaid and myself)

Thank you for your patience.

Ofer Schreiber oVirt infra team _______________________________________________ Arch mailing list Arch@ovirt.org http://lists.ovirt.org/mailman/listinfo/arch

_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

- -- name: Karsten 'quaid' Wade, Sr. Community Architect team: Red Hat Community Architecture & Leadership uri: http://communityleadershipteam.org http://TheOpenSourceWay.org gpg: AD0E0C41 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFPcPUh2ZIOBq0ODEERAlUFAKDf2LMHuUxVcLA7zFtDtBhIgCdJXQCcDZh8 xE1Urop9kHZ9pdjQGHz5AkA= =uZYC -----END PGP SIGNATURE-----

4996

Age (days ago)

5003

Last active (days ago)

List overview

Download

11 comments

6 participants

participants (6)

Eyal Edri
Itamar Heim
Karsten 'quaid' Wade
Mike Burns
Ofer Schreiber
xrx

Outage Update - www.ovirt.org and gerrit.ovirt.org

xrx

tags

participants (6)