Problems with Feedster’s robot

Starting from yesterday, I have found many hits in our web server’s logs (four hits every 30 minutes) from a robot at IP address 64.95.116.1. According to whois(1), this address belongs to “Feedster”. This is how I have discovered the existence of the Feedster blog search engine…

Special note to the person who registered my web page into Feedster yesterday: this is nice to you, but you should have better registered my real RSS feed URL (at http://www.csg.is.titech.ac.jp/~lenglet/rss.xml) instead of my web page (at http://www.csg.is.titech.ac.jp/~lenglet/), because the result is a lot of hits to non-existent URLs from the dumb Feedster robot, cf. an extract of our web server’s logs:

...
64.95.116.1 - - [08/Dec/2005:12:49:33 +0900] "GET /~lenglet HTTP/1.1" 301 336 -
64.95.116.1 - - [08/Dec/2005:12:49:33 +0900] "GET /~lenglet/ HTTP/1.1" 200 27650 -
64.95.116.1 - - [08/Dec/2005:12:49:34 +0900] "GET /atom.xml HTTP/1.1" 404 294 -
64.95.116.1 - - [08/Dec/2005:12:49:34 +0900] "GET /index.xml HTTP/1.1" 404 295 -
64.95.116.1 - - [08/Dec/2005:12:49:34 +0900] "GET /rss.xml HTTP/1.1" 404 293 -
64.95.116.1 - - [08/Dec/2005:13:17:48 +0900] "GET /~lenglet HTTP/1.1" 301 336 -
64.95.116.1 - - [08/Dec/2005:13:17:48 +0900] "GET /~lenglet/ HTTP/1.1" 200 27650 -
64.95.116.1 - - [08/Dec/2005:13:17:49 +0900] "GET /atom.xml HTTP/1.1" 403 298 -
64.95.116.1 - - [08/Dec/2005:13:17:49 +0900] "GET /index.xml HTTP/1.1" 403 299 -
64.95.116.1 - - [08/Dec/2005:13:17:49 +0900] "GET /rss.xml HTTP/1.1" 403 297 -
...

So if you could correct the URL of my feed in your Feedster account, or do anything to stop those wrong accesses, it would be very nice, thanks.

Here is why I say above that Feedster’s robot is dumb:

  1. It does not respect the Robots Exclusion Standards, which consists for web robots such as Feedster’s to access a file named robots.txt on every accessed web server to check if its accesses are welcome. Not only Feedster’s robot does not respect this standard, which is disrespectful, but also it accesses feeds every 30 minutes, which is excessive.
  2. It seems to incorrectly interpret <link rel="alternate".../> elements in HTML page headers. For instance, my XHTML web page, which has been accessed every 30 minutes by Feedster’s robot, contains such elements in its header which seem to be incorrectly interpreted by the robot. This leads to accesses to non-existent URLs (as shown in the web logs above with the 403 HTTP error codes): it should have accessed /~lenglet/atom.xml instead of /atom.xml, etc.

For information, here are the <link rel="alternate".../> elements in my web page headers:

<link rel="alternate" type="application/atom+xml" title="Atom 0.3" href="./atom.xml">
<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="./rss.xml">
<link rel="alternate" type="application/rss+xml" title="RSS 1.0" href="./index.xml">

I am certain that accesses by Feedster’s robot to /atom.xml, /index.xml and /rss.xmlare due to its interpretation of those <link rel="alternate".../> elements, because since I have denied any access to my web page to the robot it does no more try to access/atom.xml, /index.xml and /rss.xml. Here are the lines that I have added into my root .htaccess file, to specifically deny access to my web page to Feedster’s robot:

<Limit GET>
order allow,deny
deny from 64.95.116.1
allow from all
</Limit>

Now, I still get accesses from Feedster’s robot every 30 minutes, but those accesses are now denied and I am now getting those lines in our web server’s logs:

...
64.95.116.1 - - [08/Dec/2005:14:01:02 +0900] "GET /~lenglet HTTP/1.1" 403 298 -
64.95.116.1 - - [08/Dec/2005:14:20:14 +0900] "GET /~lenglet HTTP/1.1" 403 298 -
...

When they will have corrected my feed’s URL, I will probably re-enable access for that robot, but still they should correct their robot implementation…