Manage Web spiders, Web crawlers, and robots on HTTP Server

This topic provides information about how to manage Web spider, Web crawlers, and robots.

Important: Information for this topic supports the latest PTF levels for HTTP Server for i5/OS . It is recommended that you install the latest PTFs to upgrade to the latest level of the HTTP Server for i5/OS. Some of the topics documented here are not available prior to this update. See http://www.ibm.com/servers/eserver/iseries/software/http/services/service.htm Link outside Information Center for more information.

Web spiders, Web crawlers, and robots are programs that traverse the Internet retrieving documents and following links in those documents. You may have noticed entries in your log files that document requests for /robots.txt files or requests for many of your Web documents. These requests may be from a robot. Most robots adhere to the robot exclusion protocol. If you want to control what portion of your Web site robots attempt to visit, you can either use a robots.txt file or the robots meta tag.

The robots.txt file

The robots.txt file must be placed in the document root directory of the server. The following is an example of a robots.txt file:

User-agent: *
Disallow: /cgi-bin/
Note: Make sure that you do not alert hackers to important directories or files by listing them in the robots.txt file.

Robots meta tag

The robots meta tag can be placed in HTML documents to tell the robot: