Hi, I've been monitoring it this evening, and some interesting things I've found:
1) We are getting a lot of requests from bot (bing, ezooms, majestic) and some of them are
requesting malformed urls that make gerrit crash, for example:
94.249.193.61 - - [12/Jun/2013:08:41:38 -0400] "GET
/gitweb?p=ovirt-engine.git;a=blob;f=backend/manager/modules/c511backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/businessentities/VDS.java;h=bdd4414a843498ddacd548ee3dce158dc1b9a28f;hb=f122a88ef=
HTTP/1.0" 200 261640 - "Mozilla/5.0 (compatible; MJ12bot/v1.4.3;
http://www.majestic12.co.uk/bot.php?+)"
that triggers the errors:
[2013-06-12 08:41:36,409] ERROR com.google.gerrit.httpd.gitweb.GitWebServlet : CGI: fatal:
bad revision 'f122a88ef='
[2013-06-12 08:41:36,410] ERROR com.google.gerrit.httpd.gitweb.GitWebServlet : CGI: [Wed
Jun 12 08:41:35 2013] gitweb.cgi: Argument
"package org.ovirt.engine.core.common.businessentiti..." isn't
numeric in printf at /var/www/git/gitweb.cgi line 5412, <$fd> line 1.
.... A lot more similar errors
Then there are other requests that ask for bad revisions and you can see the error trace
too:
[2013-06-12 08:46:27,356] ERROR com.google.gerrit.httpd.gitweb.GitWebServlet : CGI: fatal:
bad revision 'f122a881lass517'
But I'm not sure if those are really breaking something, we could put some robots.txt
in the root or something anyhow.
2) The problems come when gerrit is unable to process a request in less than 120s (2min),
then this happens:
[2013-06-12 08:50:27,311] WARN com.google.gerrit.server.git.MultiProgressMonitor :
MultiProgressMonitor worker killed after 120587ms
java.util.concurrent.ExecutionException: java.util.concurrent.CancellationException
... full tracedump
But I'm not sure what causes the slowness. We can put a more strict robots.txt file to
avoid them.
The load is around 18-20, we can try to lower the number of concurrent threads (from
xinetd), that will make gerrit respond better to those that can connect, but refuse the
others, instead of hanging on everyone.
On the other side I see that it's not swapping, it has between 700 and 300MB of RAM
free, open files has peaks of 2000 (ulimit is set to 655350, and memory is ok, so no
problem), the number of gerrit threads is around 100, the number of connections (in any
state) is 50 tops, and the number of git-daemon processes is around 50 (xinetd has a limit
of 200...).
I'll try to get some statistics about git connections and http connections and see if
I can find any common denominator (it's very likely that most of the traffic comes
from redhat office ip, but let's take a look anyhow).
Some things that I see can be done meanwhile:
- Modify robots.txt to avoid crawlers and get rid of some of the error entries of the
log
- Create a mirror for jenkins, so it does not have to clone the repo from gerrit each
time (maybe just changing the jobs config to not cleanup the repo is enough, taking into
account that they can end up being dirty)
- Set up graphing for gerrit machine, so we can have more accurate data (that means
setting up also the graphs server), right now I'm using a small script
(monitor_gerrit.sh) that shows raw numbers...
- Lower the number of simultaneous threads to something below 50, with the inconvenience
that it will refuse some clients.
I'll keep trying to figure out what is happening.
----- Original Message -----
From: "Itamar Heim" <iheim(a)redhat.com>
To: "Omer Frenkel" <ofrenkel(a)redhat.com>, "David Caro Estevez"
<dcaroest(a)redhat.com>
Cc: "infra" <infra(a)ovirt.org>
Sent: Wednesday, June 12, 2013 4:22:32 PM
Subject: Re: gerrit not working
On 06/12/2013 04:28 PM, Omer Frenkel wrote:
> i cant fetch, asked around and seems its not only me,
> can anyone take a look?
>
> $ git fetch
> Write failed: Broken pipe
> fatal: Could not read from remote repository.
>
> Please make sure you have the correct access rights
> and the repository exists.
>
I restarted the service (which more people can try via the jenkins job
as first mitigation).
david - care to look at the logs and try to understand which of new
errors in the logs post the upgrade are interesting?
thanks,
Itamar