Here is a feed crawler again…

Again, yesterday and today we get hits in our server’s log from what looks like a robot with IP address 209.237.230.104:

209.237.230.104 - - [03/Jan/2006:18:44:19 +0900] "GET /~lenglet HTTP/1.0" 301 324 -
209.237.230.104 - - [03/Jan/2006:18:44:20 +0900] "GET /~lenglet/ HTTP/1.0" 200 31782 -
209.237.230.104 - - [03/Jan/2006:18:44:20 +0900] "GET /atom.xml HTTP/1.0" 404 282 -
209.237.230.104 - - [03/Jan/2006:18:44:21 +0900] "GET /rss.xml HTTP/1.0" 404 281 -
209.237.230.104 - - [03/Jan/2006:18:44:21 +0900] "GET /index.xml HTTP/1.0" 404 283 -
209.237.230.104 - - [04/Jan/2006:16:47:22 +0900] "GET /~lenglet HTTP/1.0" 301 324 -
209.237.230.104 - - [04/Jan/2006:16:47:22 +0900] "GET /~lenglet/ HTTP/1.0" 200 31782 -
209.237.230.104 - - [04/Jan/2006:16:47:23 +0900] "GET /atom.xml HTTP/1.0" 404 282 -
209.237.230.104 - - [04/Jan/2006:16:47:23 +0900] "GET /rss.xml HTTP/1.0" 404 281 -
209.237.230.104 - - [04/Jan/2006:16:47:23 +0900] "GET /index.xml HTTP/1.0" 404 283 -

This looks exactly like the hits I recently got from Feedster’s crappy robot which was looking for RSS feeds from my web page. I had to send an email to Feedster, which they quickly responded to, and they soon stopped hitting our server.

Are they back with a revenge? Or have they sold the code of their buggy robot to someone else? Anyway, 209.237.230.104 is none of their addresses: it belongs to United Layer, an ISP which is probably hosting the robot that generates the hits I observed.

I have addedd yet another entry in my Apache .htaccess configuration file to deny any access to 209.237.230.104… When will these people learn how to respect standards, including the Robots Exclusion Standards?!