gerrit not working

Wed Jun 12 15:40:43 UTC 2013

Hi, I've been monitoring it this evening, and some interesting things I've found:

1) We are getting a lot of requests from bot (bing, ezooms, majestic) and some of them are requesting malformed urls that make gerrit crash, for example:

94.249.193.61 - - [12/Jun/2013:08:41:38 -0400] "GET /gitweb?p=ovirt-engine.git;a=blob;f=backend/manager/modules/c511backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/businessentities/VDS.java;h=bdd4414a843498ddacd548ee3dce158dc1b9a28f;hb=f122a88ef= HTTP/1.0" 200 261640 - "Mozilla/5.0 (compatible; MJ12bot/v1.4.3; http://www.majestic12.co.uk/bot.php?+)"

that triggers the errors:

[2013-06-12 08:41:36,409] ERROR com.google.gerrit.httpd.gitweb.GitWebServlet : CGI: fatal: bad revision 'f122a88ef='         
[2013-06-12 08:41:36,410] ERROR com.google.gerrit.httpd.gitweb.GitWebServlet : CGI: [Wed Jun 12 08:41:35 2013] gitweb.cgi: Argument "package&nbsp;org.ovirt.engine.core.common.businessentiti..." isn't numeric in printf at /var/www/git/gitweb.cgi line 5412, <$fd> line 1.
.... A lot more similar errors

Then there are other requests that ask for bad revisions and you can see the error trace too:
[2013-06-12 08:46:27,356] ERROR com.google.gerrit.httpd.gitweb.GitWebServlet : CGI: fatal: bad revision 'f122a881lass517'

But I'm not sure if those are really breaking something, we could put some robots.txt in the root or something anyhow.

2) The problems come when gerrit is unable to process a request in less than 120s (2min), then this happens:

[2013-06-12 08:50:27,311] WARN  com.google.gerrit.server.git.MultiProgressMonitor : MultiProgressMonitor worker killed after 120587ms
java.util.concurrent.ExecutionException: java.util.concurrent.CancellationException
... full tracedump

But I'm not sure what causes the slowness. We can put a more strict robots.txt file to avoid them.

The load is around 18-20, we can try to lower the number of concurrent threads (from xinetd), that will make gerrit respond better to those that can connect, but refuse the others, instead of hanging on everyone.

On the other side I see that it's not swapping, it has between 700 and 300MB of RAM free, open files has peaks of 2000 (ulimit is set to 655350, and memory is ok, so no problem), the number of gerrit threads is around 100, the number of connections (in any state) is 50 tops, and the number of git-daemon processes is around 50 (xinetd has a limit of 200...).

I'll try to get some statistics about git connections and http connections and see if I can find any common denominator (it's very likely that most of the traffic comes from redhat office ip, but let's take a look anyhow).

Some things that I see can be done meanwhile:
 - Modify robots.txt to avoid crawlers and get rid of some of the error entries of the log
 - Create a mirror for jenkins, so it does not have to clone the repo from gerrit each time (maybe just changing the jobs config to not cleanup the repo is enough, taking into account that they can end up being dirty)
 - Set up graphing for gerrit machine, so we can have more accurate data (that means setting up also the graphs server), right now I'm using a small script (monitor_gerrit.sh) that shows raw numbers...
 - Lower the number of simultaneous threads to something below 50, with the inconvenience that it will refuse some clients.

I'll keep trying to figure out what is happening.

----- Original Message -----
> From: "Itamar Heim" <iheim at redhat.com>
> To: "Omer Frenkel" <ofrenkel at redhat.com>, "David Caro Estevez" <dcaroest at redhat.com>
> Cc: "infra" <infra at ovirt.org>
> Sent: Wednesday, June 12, 2013 4:22:32 PM
> Subject: Re: gerrit not working
> 
> On 06/12/2013 04:28 PM, Omer Frenkel wrote:
> > i cant fetch, asked around and seems its not only me,
> > can anyone take a look?
> >
> > $ git fetch
> > Write failed: Broken pipe
> > fatal: Could not read from remote repository.
> >
> > Please make sure you have the correct access rights
> > and the repository exists.
> >
> 
> I restarted the service (which more people can try via the jenkins job
> as first mitigation).
> 
> david - care to look at the logs and try to understand which of new
> errors in the logs post the upgrade are interesting?
> 
> thanks,
>     Itamar
>