So, parts of my site I do not want indexed, yet it continues to appear on the search engines. Yes, previously I tried robots.txt files, but apparently they do not work. I can see in the logs some spiders even fetch the robots.txt file then go on to walk my site. WTF?
Later I tried to identify the subnets which the spiders came from and deny the content. Apache has a Deny from x.x.x.x operative, but that was too reactionary, time-consuming, and ineffective. It returns a 403 HTTP code which apparently does not delete the old versions from their archive.
In the current stage of this war was, I configured my server not to provide the requested documents when it detects a spider. It does return a document, but with some links I am promoting instead.
|
Name | Last modified | Size | Description | |
---|---|---|---|---|
Parent Directory | - | |||
spider_httpd.conf.txt | 2002-08-30 10:36 | 737 | ||
spider.html | 2002-08-30 10:33 | 464 | ||