75 lines
5.3 KiB
HTML
75 lines
5.3 KiB
HTML
|
<?xml version="1.0" encoding="UTF-8"?>
|
||
|
<!DOCTYPE html
|
||
|
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||
|
<html lang="en-us" xml:lang="en-us">
|
||
|
<head>
|
||
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
||
|
<meta name="security" content="public" />
|
||
|
<meta name="Robots" content="index,follow" />
|
||
|
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
|
||
|
<meta name="DC.Type" content="topic" />
|
||
|
<meta name="DC.Title" content="Web crawling on HTTP Server" />
|
||
|
<meta name="abstract" content="This topic provides information about Web crawling and Web crawlers." />
|
||
|
<meta name="description" content="This topic provides information about Web crawling and Web crawlers." />
|
||
|
<meta name="DC.Relation" scheme="URI" content="rzaieconcepts.htm" />
|
||
|
<meta name="copyright" content="(C) Copyright IBM Corporation 2002,2006" />
|
||
|
<meta name="DC.Rights.Owner" content="(C) Copyright IBM Corporation 2002,2006" />
|
||
|
<meta name="DC.Format" content="XHTML" />
|
||
|
<meta name="DC.Identifier" content="rzaiewebcrwl" />
|
||
|
<meta name="DC.Language" content="en-us" />
|
||
|
<!-- All rights reserved. Licensed Materials Property of IBM -->
|
||
|
<!-- US Government Users Restricted Rights -->
|
||
|
<!-- Use, duplication or disclosure restricted by -->
|
||
|
<!-- GSA ADP Schedule Contract with IBM Corp. -->
|
||
|
<link rel="stylesheet" type="text/css" href="./ibmdita.css" />
|
||
|
<link rel="stylesheet" type="text/css" href="./ic.css" />
|
||
|
<title>Web crawling on HTTP Server </title>
|
||
|
</head>
|
||
|
<body id="rzaiewebcrwl"><a name="rzaiewebcrwl"><!-- --></a>
|
||
|
<!-- Java sync-link --><script language="Javascript" src="../rzahg/synch.js" type="text/javascript"></script>
|
||
|
<h1 class="topictitle1">Web crawling on HTTP Server </h1>
|
||
|
<div><p>This topic provides information about Web crawling and Web crawlers.</p>
|
||
|
<div class="important"><span class="importanttitle">Important:</span> Information
|
||
|
for this topic supports the latest PTF levels for HTTP Server for i5/OS .
|
||
|
It is recommended that you install the latest PTFs to upgrade to the latest
|
||
|
level of the HTTP Server for i5/OS. Some of the topics documented here are
|
||
|
not available prior to this update. See <a href="http://www-03.ibm.com/servers/eserver/iseries/software/http/services/service.html" target="_blank">http://www.ibm.com/servers/eserver/iseries/software/http/services/service.htm</a> <img src="www.gif" alt="Link outside Information Center" /> for more information. </div>
|
||
|
<p>A Web crawler is a program that finds a URL on another Web server. A "crawl"
|
||
|
is the Web crawler program following links within Web pages and downloading
|
||
|
HTML and text pages it finds. The Web crawler downloads files to your local
|
||
|
directory, and creates a document list. The document list and the files can
|
||
|
then be used to create a search index. The search results will link to the
|
||
|
actual URL that was found during the crawl. </p>
|
||
|
<div class="attention"><span class="attentiontitle">Attention:</span> The Web crawler downloads text and
|
||
|
HTML files to your iSeries™. The iSeries checks if sufficient memory
|
||
|
is available for a successful Web crawl, but it will not check for available
|
||
|
storage.</div>
|
||
|
<p>To crawl a Web site, you must specify attributes such as the document storage
|
||
|
directory, the URL to crawl, and so on. Alternately, you may start a crawl
|
||
|
using a URL and options object that you have already created using other forms.
|
||
|
A URL object contains a list of URLs. An options object contains crawling
|
||
|
attributes, such as the proxy server to use for each crawling session. </p>
|
||
|
<p>Some sites cannot be entered without some sort of authentication, such
|
||
|
as a userid and password, or certificate authentication. The web crawler has
|
||
|
the capacity to handle either case as long as you do the required set up.
|
||
|
</p>
|
||
|
<p>For a site requiring a userid and password, you must create a validation
|
||
|
list object, entering the URL, userid, and password. See <a href="rzaievallstsrch.htm">Set up validation lists for the Webserver search engine on HTTP Server</a> for more information. Then be sure to enter the validation
|
||
|
list object when you start crawling. See the <a href="../rzahu/rzahurzahu4abunderstanddc.htm">digital server certificate</a> information on
|
||
|
how to obtain certificate authentication. The digital certificate manager
|
||
|
can be used to obtain a new, or register an existing, certificate for any
|
||
|
secure server instance of the IBM<sup>®</sup> HTTP Server. </p>
|
||
|
<p>Building a document list by crawling Web sites always runs as a background
|
||
|
task and will take several minutes, at a minimum, to run, depending on the
|
||
|
maximum time you selected for the session to run, as well as other attributes
|
||
|
you have specified. </p>
|
||
|
<p>See <a href="rzaiedoclstsrch.htm#crawl">Build the document list by crawling a URL</a> for information
|
||
|
on how to use the Web crawler with the <span>IBM Web Administration for i5/OS™ interface</span>.</p>
|
||
|
</div>
|
||
|
<div>
|
||
|
<div class="familylinks">
|
||
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="rzaieconcepts.htm" title="This topic provides concepts of functions on HTTP Server and IBM Web Administration for i5/OS interface.">Concepts of functions of HTTP Server</a></div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</body>
|
||
|
</html>
|