ibm-information-center/dist/eclipse/plugins/i5OS.ic.rzaie_5.4.0.1/rzaiewebcrwl.htm

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-us" xml:lang="en-us">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="security" content="public" />
<meta name="Robots" content="index,follow" />
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
<meta name="DC.Type" content="topic" />
<meta name="DC.Title" content="Web crawling on HTTP Server" />
<meta name="abstract" content="This topic provides information about Web crawling and Web crawlers." />
<meta name="description" content="This topic provides information about Web crawling and Web crawlers." />
<meta name="DC.Relation" scheme="URI" content="rzaieconcepts.htm" />
<meta name="copyright" content="(C) Copyright IBM Corporation 2002,2006" />
<meta name="DC.Rights.Owner" content="(C) Copyright IBM Corporation 2002,2006" />
<meta name="DC.Format" content="XHTML" />
<meta name="DC.Identifier" content="rzaiewebcrwl" />
<meta name="DC.Language" content="en-us" />
<!-- All rights reserved. Licensed Materials Property of IBM -->
<!-- US Government Users Restricted Rights -->
<!-- Use, duplication or disclosure restricted by -->
<!-- GSA ADP Schedule Contract with IBM Corp. -->
<link rel="stylesheet" type="text/css" href="./ibmdita.css" />
<link rel="stylesheet" type="text/css" href="./ic.css" />
<title>Web crawling on HTTP Server </title>
</head>
<body id="rzaiewebcrwl"><a name="rzaiewebcrwl"><!-- --></a>
<!-- Java sync-link --><script language="Javascript" src="../rzahg/synch.js" type="text/javascript"></script>
<h1 class="topictitle1">Web crawling on HTTP Server </h1>
<div><p>This topic provides information about Web crawling and Web crawlers.</p>
<div class="important"><span class="importanttitle">Important:</span> Information
for this topic supports the latest PTF levels for HTTP Server for i5/OS .
 It is recommended that you install the latest PTFs to upgrade to the latest
level of the HTTP Server for i5/OS. Some of the topics documented here are
not available prior to this update. See <a href="http://www-03.ibm.com/servers/eserver/iseries/software/http/services/service.html" target="_blank">http://www.ibm.com/servers/eserver/iseries/software/http/services/service.htm</a> <img src="www.gif" alt="Link outside Information Center" /> for more information. </div>
<p>A Web crawler is a program that finds a URL on another Web server. A "crawl"
is the Web crawler program following links within Web pages and downloading
HTML and text pages it finds. The Web crawler downloads files to your local
directory, and creates a document list. The document list and the files can
then be used to create a search index. The search results will link to the
actual URL that was found during the crawl. </p>
<div class="attention"><span class="attentiontitle">Attention:</span> The Web crawler downloads text and
HTML files to your iSeries™. The iSeries checks if sufficient memory
is available for a successful Web crawl, but it will not check for available
storage.</div>
<p>To crawl a Web site, you must specify attributes such as the document storage
directory, the URL to crawl, and so on. Alternately, you may start a crawl
using a URL and options object that you have already created using other forms.
A URL object contains a list of URLs. An options object contains crawling
attributes, such as the proxy server to use for each crawling session.  </p>
<p>Some sites cannot be entered without some sort of authentication, such
as a userid and password, or certificate authentication. The web crawler has
the capacity to handle either case as long as you do the required set up.
  </p>
<p>For a site requiring a userid and password, you must create a validation
list object, entering the URL, userid, and password. See <a href="rzaievallstsrch.htm">Set up validation lists for the Webserver search engine on HTTP Server</a> for more information. Then be sure to enter the validation
list object when you start crawling. See the <a href="../rzahu/rzahurzahu4abunderstanddc.htm">digital server certificate</a> information on
how to obtain certificate authentication. The digital certificate manager
can be used to obtain a new, or register an existing, certificate for any
secure server instance of the IBM<sup>®</sup> HTTP Server.  </p>
<p>Building a document list by crawling Web sites always runs as a background
task and will take several minutes, at a minimum, to run, depending on the
maximum time you selected for the session to run, as well as other attributes
you have specified. </p>
<p>See <a href="rzaiedoclstsrch.htm#crawl">Build the document list by crawling a URL</a> for information
on how to use the Web crawler with the <span>IBM Web Administration for i5/OS™ interface</span>.</p>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="rzaieconcepts.htm" title="This topic provides concepts of functions on HTTP Server and IBM Web Administration for i5/OS interface.">Concepts of functions of HTTP Server</a></div>
</div>
</div>
</body>
</html>