Index of /software/spider

So, parts of my site I do not want indexed, yet it continues to appear on the search engines. Yes, previously I tried robots.txt files, but apparently they do not work. I can see in the logs some spiders even fetch the robots.txt file then go on to walk my site. WTF?

Later I tried to identify the subnets which the spiders came from and deny the content. Apache has a Deny from x.x.x.x operative, but that was too reactionary, time-consuming, and ineffective. It returns a 403 HTTP code which apparently does not delete the old versions from their archive.

In the current stage of this war was, I configured my server not to provide the requested documents when it detects a spider. It does return a document, but with some links I am promoting instead.

# This goes in your httpd.conf file, then run... # apachectl graceful # # This little bit of config tries to keep parts # of a site from being indexed by spiders. # The spiders are screened by their User-Agent # strings. Instead a default message is sent. # <Directory /var/www/htdocs/no_indexed> BrowserMatchNoCase http spider=1 BrowserMatchNoCase bot spider=1 BrowserMatchNoCase crawler spider=1 BrowserMatchNoCase spider spider=1 BrowserMatch Jeeves\)$ spider=1 BrowserMatch ^Bilbo spider=1 BrowserMatch ^websec spider=1 BrowserMatch ^Scooter spider=1 BrowserMatch ^Mercator spider=1 BrowserMatch ^Java1 spider=1 BrowserMatch ^$ spider=1 RewriteEngine On RewriteCond %{ENV:spider} ^1$ RewriteRule .* /spider.html #Deny from env=spider # would give 403 Denied </Directory>

Name Last modified Size Description

[PARENTDIR] Parent Directory -

[TXT] spider_httpd.conf.txt 2002-08-30 10:36 737

[TXT] spider.html 2002-08-30 10:33 464

Name	Last modified	Size

Parent Directory		-
spider_httpd.conf.txt	2002-08-30 10:36	737
spider.html	2002-08-30 10:33	464