ibm-information-center/dist/eclipse/plugins/i5OS.ic.rzaie_5.4.0.1/rzaiespiders.htm

66 lines
4.2 KiB
HTML
Raw Normal View History

2024-04-02 14:02:31 +00:00
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-us" xml:lang="en-us">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="security" content="public" />
<meta name="Robots" content="index,follow" />
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
<meta name="DC.Type" content="topic" />
<meta name="DC.Title" content="Manage Web spiders, Web crawlers, and robots on HTTP Server" />
<meta name="abstract" content="This topic provides information about how to manage Web spider, Web crawlers, and robots." />
<meta name="description" content="This topic provides information about how to manage Web spider, Web crawlers, and robots." />
<meta name="DC.Relation" scheme="URI" content="rzaieparsearch.htm" />
<meta name="copyright" content="(C) Copyright IBM Corporation 2002,2006" />
<meta name="DC.Rights.Owner" content="(C) Copyright IBM Corporation 2002,2006" />
<meta name="DC.Format" content="XHTML" />
<meta name="DC.Identifier" content="rzaiespiders" />
<meta name="DC.Language" content="en-us" />
<!-- All rights reserved. Licensed Materials Property of IBM -->
<!-- US Government Users Restricted Rights -->
<!-- Use, duplication or disclosure restricted by -->
<!-- GSA ADP Schedule Contract with IBM Corp. -->
<link rel="stylesheet" type="text/css" href="./ibmdita.css" />
<link rel="stylesheet" type="text/css" href="./ic.css" />
<title>Manage Web spiders, Web crawlers, and robots on HTTP Server</title>
</head>
<body id="rzaiespiders"><a name="rzaiespiders"><!-- --></a>
<!-- Java sync-link --><script language="Javascript" src="../rzahg/synch.js" type="text/javascript"></script>
<h1 class="topictitle1">Manage Web spiders, Web crawlers, and robots on HTTP Server</h1>
<div><p>This topic provides information about how to manage Web spider,
Web crawlers, and robots.</p>
<div class="important"><span class="importanttitle">Important:</span> Information
for this topic supports the latest PTF levels for HTTP Server for i5/OS .
It is recommended that you install the latest PTFs to upgrade to the latest
level of the HTTP Server for i5/OS. Some of the topics documented here are
not available prior to this update. See <a href="http://www-03.ibm.com/servers/eserver/iseries/software/http/services/service.html" target="_blank">http://www.ibm.com/servers/eserver/iseries/software/http/services/service.htm</a> <img src="www.gif" alt="Link outside Information Center" /> for more information. </div>
<p>Web spiders, Web crawlers, and robots are programs that traverse the Internet
retrieving documents and following links in those documents. You may have
noticed entries in your log files that document requests for /robots.txt files
or requests for many of your Web documents. These requests may be from a robot.
Most robots adhere to the robot exclusion protocol. If you want to control
what portion of your Web site robots attempt to visit, you can either use
a robots.txt file or the robots meta tag. </p>
<p><strong>The robots.txt file </strong> </p>
<p>The robots.txt file must be placed in the document root directory of the
server. The following is an example of a robots.txt file: </p>
<pre class="block">User-agent: *
Disallow: /cgi-bin/</pre>
<div class="note"><span class="notetitle">Note:</span> Make sure that you do not alert hackers to important directories or
files by listing them in the robots.txt file. </div>
<p><strong>Robots meta tag </strong></p>
<p>The robots meta tag can be placed in HTML documents to tell the robot: </p>
<ul><li>Do not index a document <pre>&lt;META NAME="ROBOTS" CONTENT="NOINDEX"&gt;</pre>
</li>
<li>Do not follow links in a document <pre>&lt;META NAME="ROBOTS" CONTENT="NOFOLLOW"&gt;</pre>
</li>
</ul>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="rzaieparsearch.htm" title="This topic provides step-by-step tasks for the Webserver search engine.">Search tasks</a></div>
</div>
</div>
</body>
</html>