Yesterday, a long standing issue with my Redmine installation at CodeTRAX has hopefully been addressed. For a very long time, I’ve been wondering why all indexing bots, both the well known ones and also others not so popular, had been ignoring the Disallow
rules as defined in the robots.txt file and kept indexing the sections of the web site that were excluded. My explanation so far had been that, since the web site lacked a proper robots.txt
file for many years, the indexing services needed much time before actually taking the new rules into consideration and adjusting their indexing bots’ behavior. By trying to index sections such as the issue tracker, calendar or the repository, which provide various different ways of displaying and sorting the available data using query arguments, they ended up causing a tremendous increase of the server load.
Yesterday, I found some time to look again into this issue and, after thoroughly checking the response headers when the robots.txt
file was requested, I noticed a small detail which I had overlooked in the past. The Content-Type of the resource was set to text/html
in the HTTP response headers instead of the correct text/plain
. This was a little strange, so I decided to look into it more closely.
Redmine’s robots.txt
file is generated dynamically at run time by using the template file at app/views/welcome/robots.html.erb
. I had created a custom plugin, in which, among other things, I override this template in order to extend the exclusion rules with more sections of the web site, such as each project’s issue and time tracker, the repository, calendar, gantt section etc. These parts of a Redmine site tend to generate a huge number of web pages due to the various query arguments that are available for sorting and for alternate data views. My robots.html.erb
looks like this:
User-agent: * <% @projects.each do |p| -%> Disallow: /projects/<%= p.to_param %>/time_entries.csv Disallow: /projects/<%= p.to_param %>/activity Disallow: /projects/<%= p.to_param %>/activity.atom Disallow: /projects/<%= p.to_param %>/roadmap Disallow: /projects/<%= p.to_param %>/issues Disallow: /projects/<%= p.to_param %>/issues.atom Disallow: /projects/<%= p.to_param %>/issues.pdf Disallow: /projects/<%= p.to_param %>/issues.csv Disallow: /projects/<%= p.to_param %>/issues/calendar Disallow: /projects/<%= p.to_param %>/issues/gantt Disallow: /projects/<%= p.to_param %>/issues/report Disallow: /projects/<%= p.to_param %>/time_entries Disallow: /projects/<%= p.to_param %>/time_entries.atom Disallow: /projects/<%= p.to_param %>/time_entries.csv Disallow: /projects/<%= p.to_param %>/wiki/Wiki/history Disallow: /projects/<%= p.to_param %>/wiki/date_index Disallow: /projects/<%= p.to_param %>/repository Disallow: /projects/<%= p.to_param %>/repository/annotate Disallow: /projects/<%= p.to_param %>/repository/diff Disallow: /projects/<%= p.to_param %>/repository/statistics <% end -%> Disallow: /issues Disallow: /issues.atom Disallow: /issues.pdf Disallow: /issues.csv Disallow: /issues/gantt Disallow: /issues/calendar Disallow: /activity Disallow: /activity.atom Disallow: /time_entries Disallow: /time_entries.atom Disallow: /time_entries.csv Disallow: /login
After experimenting with it for a while, I came to the conclusion that for a very weird reason my Redmine application always returns the robots.txt
file using the text/html
content type. I tried clearing the cache in the application’s tmp/cache/
directory, restarted the application server, cleared Varnish‘s and my browser’s cache, but still the content type of the file was returned as:
... Content-Type: text/html; charset=utf-8 ...
At this point, I cannot safely tell what the real cause of this issue is. It could be a bug of Redmine, which I really doubt, or it could be a problem with the rather complicated and experimental server configuration I currently use on the box where CodeTRAX is hosted. Time permitting, in the following months, I’m going to do some more research about this.
As I needed a quick resolution, I implemented the following workarounds, which actually enforce the content type when the robots.txt file is requested.
I added the following to the Varnish configuration (I’ll post more information about running Redmine behind Varnish in a future post. I’m still experimenting with it!):
sub vcl_backend_response { if (bereq.url == "/robots.txt") { # Make sure robots.txt has correct content type. set beresp.http.Content-Type = "text/plain; charset=utf-8"; # Force a caching timeout (TTL) of 1 hour. set beresp.ttl = 1h; } }
Also, to be on the safe side, I added the following to the Apache configuration:
<Files "robots.txt"> ForceType "text/plain" </Files>
So, now robots.txt
is always returned with the correct text/plain
content type to the HTTP client. My guess is that until now the indexing bots could not properly evaluate the contents of this file, due to the wrong content type they were given, and ended up ignoring all indexing rules inside it. I’ll have to wait for several weeks before I am certain that the wrong content type was the actual reason why the indexing bots completely ignored the indexing rules, but the more I think about it the more certain I am that this small detail has been the cause of the problem.
Update (Sep 23, 2016): The robots.html.erb
template has been revised.
Fixing the robots.txt content type of Redmine at CodeTRAX by George Notaras is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright © 2016 - Some Rights Reserved