So, parts of my site I do not want indexed, yet it continues to appear on the search engines. Yes, previously I tried robots.txt files, but apparently they do not work. I can see in the logs some spiders even fetch the robots.txt file then go on to walk my site. WTF?

Later I tried to identify the subnets which the spiders came from and deny the content. Apache has a Deny from x.x.x.x operative, but that was too reactionary, time-consuming, and ineffective. It returns a 403 HTTP code which apparently does not delete the old versions from their archive.

In the current stage of this war was, I configured my server not to provide the requested documents when it detects a spider. It does return a document, but with some links I am promoting instead.

#  This goes in your httpd.conf file, then run...
#     apachectl graceful
#
#  This little bit of config tries to keep parts
#  of a site from being indexed by spiders.
#  The spiders are screened by their User-Agent
#  strings.  Instead a default message is sent.
#

<Directory /var/www/htdocs/no_indexed>
    BrowserMatchNoCase http spider=1
    BrowserMatchNoCase bot spider=1
    BrowserMatchNoCase crawler spider=1
    BrowserMatchNoCase spider spider=1
    BrowserMatch Jeeves\)$ spider=1
    BrowserMatch ^Bilbo spider=1
    BrowserMatch ^websec spider=1
    BrowserMatch ^Scooter spider=1
    BrowserMatch ^Mercator spider=1
    BrowserMatch ^Java1 spider=1
    BrowserMatch ^$ spider=1
    RewriteEngine On
    RewriteCond %{ENV:spider} ^1$
    RewriteRule .* /spider.html
    #Deny from env=spider   # would give 403 Denied
</Directory>
[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory   -  
[TXT]spider.html 2002-08-30 10:33 464  
[TXT]spider_httpd.conf.txt 2002-08-30 10:36 737