ibm-information-center/dist/eclipse/plugins/i5OS.ic.rzaie_5.4.0.1/rzaiewebcrwl.htm

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-us" xml:lang="en-us">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="security" content="public" />
<meta name="Robots" content="index,follow" />
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
<meta name="DC.Type" content="topic" />
<meta name="DC.Title" content="Web crawling on HTTP Server" />
<meta name="abstract" content="This topic provides information about Web crawling and Web crawlers." />
<meta name="description" content="This topic provides information about Web crawling and Web crawlers." />
<meta name="DC.Relation" scheme="URI" content="rzaieconcepts.htm" />
<meta name="copyright" content="(C) Copyright IBM Corporation 2002,2006" />
<meta name="DC.Rights.Owner" content="(C) Copyright IBM Corporation 2002,2006" />
<meta name="DC.Format" content="XHTML" />
<meta name="DC.Identifier" content="rzaiewebcrwl" />
<meta name="DC.Language" content="en-us" />
<!-- All rights reserved. Licensed Materials Property of IBM -->
<!-- US Government Users Restricted Rights -->
<!-- Use, duplication or disclosure restricted by -->
<!-- GSA ADP Schedule Contract with IBM Corp. -->
<link rel="stylesheet" type="text/css" href="./ibmdita.css" />
<link rel="stylesheet" type="text/css" href="./ic.css" />
<title>Web crawling on HTTP Server </title>
</head>
<body id="rzaiewebcrwl"><a name="rzaiewebcrwl"><!-- --></a>
<!-- Java sync-link --><script language="Javascript" src="../rzahg/synch.js" type="text/javascript"></script>
<h1 class="topictitle1">Web crawling on HTTP Server </h1>
<div><p>This topic provides information about Web crawling and Web crawlers.</p>
<div class="important"><span class="importanttitle">Important:</span> Information
for this topic supports the latest PTF levels for HTTP Server for i5/OS .
 It is recommended that you install the latest PTFs to upgrade to the latest
level of the HTTP Server for i5/OS. Some of the topics documented here are
not available prior to this update. See <a href="http://www-03.ibm.com/servers/eserver/iseries/software/http/services/service.html" target="_blank">http://www.ibm.com/servers/eserver/iseries/software/http/services/service.htm</a> <img src="www.gif" alt="Link outside Information Center" /> for more information. </div>
<p>A Web crawler is a program that finds a URL on another Web server. A "crawl"
is the Web crawler program following links within Web pages and downloading
HTML and text pages it finds. The Web crawler downloads files to your local
directory, and creates a document list. The document list and the files can
then be used to create a search index. The search results will link to the
actual URL that was found during the crawl. </p>
<div class="attention"><span class="attentiontitle">Attention:</span> The Web crawler downloads text and
HTML files to your iSeries™. The iSeries checks if sufficient memory
is available for a successful Web crawl, but it will not check for available
storage.</div>
<p>To crawl a Web site, you must specify attributes such as the document storage
directory, the URL to crawl, and so on. Alternately, you may start a crawl
using a URL and options object that you have already created using other forms.
A URL object contains a list of URLs. An options object contains crawling
attributes, such as the proxy server to use for each crawling session.  </p>
<p>Some sites cannot be entered without some sort of authentication, such
as a userid and password, or certificate authentication. The web crawler has
the capacity to handle either case as long as you do the required set up.
  </p>
<p>For a site requiring a userid and password, you must create a validation
list object, entering the URL, userid, and password. See <a href="rzaievallstsrch.htm">Set up validation lists for the Webserver search engine on HTTP Server</a> for more information. Then be sure to enter the validation
list object when you start crawling. See the <a href="../rzahu/rzahurzahu4abunderstanddc.htm">digital server certificate</a> information on
how to obtain certificate authentication. The digital certificate manager
can be used to obtain a new, or register an existing, certificate for any
secure server instance of the IBM<sup>®</sup> HTTP Server.  </p>
<p>Building a document list by crawling Web sites always runs as a background
task and will take several minutes, at a minimum, to run, depending on the
maximum time you selected for the session to run, as well as other attributes
you have specified. </p>
<p>See <a href="rzaiedoclstsrch.htm#crawl">Build the document list by crawling a URL</a> for information
on how to use the Web crawler with the <span>IBM Web Administration for i5/OS™ interface</span>.</p>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="rzaieconcepts.htm" title="This topic provides concepts of functions on HTTP Server and IBM Web Administration for i5/OS interface.">Concepts of functions of HTTP Server</a></div>
</div>
</div>
</body>
</html>
Add readme and dist folders 2024-04-02 14:02:31 +00:00			`<?xml version="1.0" encoding="UTF-8"?>`
			`<!DOCTYPE html`
			`PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">`
			`<html lang="en-us" xml:lang="en-us">`
			`<head>`
			`<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />`
			`<meta name="security" content="public" />`
			`<meta name="Robots" content="index,follow" />`
			`<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />`
			`<meta name="DC.Type" content="topic" />`
			`<meta name="DC.Title" content="Web crawling on HTTP Server" />`
			`<meta name="abstract" content="This topic provides information about Web crawling and Web crawlers." />`
			`<meta name="description" content="This topic provides information about Web crawling and Web crawlers." />`
			`<meta name="DC.Relation" scheme="URI" content="rzaieconcepts.htm" />`
			`<meta name="copyright" content="(C) Copyright IBM Corporation 2002,2006" />`
			`<meta name="DC.Rights.Owner" content="(C) Copyright IBM Corporation 2002,2006" />`
			`<meta name="DC.Format" content="XHTML" />`
			`<meta name="DC.Identifier" content="rzaiewebcrwl" />`
			`<meta name="DC.Language" content="en-us" />`
			`<!-- All rights reserved. Licensed Materials Property of IBM -->`
			`<!-- US Government Users Restricted Rights -->`
			`<!-- Use, duplication or disclosure restricted by -->`
			`<!-- GSA ADP Schedule Contract with IBM Corp. -->`
			`<link rel="stylesheet" type="text/css" href="./ibmdita.css" />`
			`<link rel="stylesheet" type="text/css" href="./ic.css" />`
			`<title>Web crawling on HTTP Server </title>`
			`</head>`
			`<body id="rzaiewebcrwl"><a name="rzaiewebcrwl"><!-- --></a>`
			`<!-- Java sync-link --><script language="Javascript" src="../rzahg/synch.js" type="text/javascript"></script>`
			`<h1 class="topictitle1">Web crawling on HTTP Server </h1>`
			`<div><p>This topic provides information about Web crawling and Web crawlers.</p>`
			`<div class="important"><span class="importanttitle">Important:</span> Information`
			`for this topic supports the latest PTF levels for HTTP Server for i5/OS .`
			`It is recommended that you install the latest PTFs to upgrade to the latest`
			`level of the HTTP Server for i5/OS. Some of the topics documented here are`
			`not available prior to this update. See <a href="http://www-03.ibm.com/servers/eserver/iseries/software/http/services/service.html" target="_blank">http://www.ibm.com/servers/eserver/iseries/software/http/services/service.htm</a> <img src="www.gif" alt="Link outside Information Center" /> for more information. </div>`
			`<p>A Web crawler is a program that finds a URL on another Web server. A "crawl"`
			`is the Web crawler program following links within Web pages and downloading`
			`HTML and text pages it finds. The Web crawler downloads files to your local`
			`directory, and creates a document list. The document list and the files can`
			`then be used to create a search index. The search results will link to the`
			`actual URL that was found during the crawl. </p>`
			`<div class="attention"><span class="attentiontitle">Attention:</span> The Web crawler downloads text and`
			`HTML files to your iSeries™. The iSeries checks if sufficient memory`
			`is available for a successful Web crawl, but it will not check for available`
			`storage.</div>`
			`<p>To crawl a Web site, you must specify attributes such as the document storage`
			`directory, the URL to crawl, and so on. Alternately, you may start a crawl`
			`using a URL and options object that you have already created using other forms.`
			`A URL object contains a list of URLs. An options object contains crawling`
			`attributes, such as the proxy server to use for each crawling session. </p>`
			`<p>Some sites cannot be entered without some sort of authentication, such`
			`as a userid and password, or certificate authentication. The web crawler has`
			`the capacity to handle either case as long as you do the required set up.`
			`</p>`
			`<p>For a site requiring a userid and password, you must create a validation`
			`list object, entering the URL, userid, and password. See <a href="rzaievallstsrch.htm">Set up validation lists for the Webserver search engine on HTTP Server</a> for more information. Then be sure to enter the validation`
			`list object when you start crawling. See the <a href="../rzahu/rzahurzahu4abunderstanddc.htm">digital server certificate</a> information on`
			`how to obtain certificate authentication. The digital certificate manager`
			`can be used to obtain a new, or register an existing, certificate for any`
			`secure server instance of the IBM<sup>®</sup> HTTP Server. </p>`
			`<p>Building a document list by crawling Web sites always runs as a background`
			`task and will take several minutes, at a minimum, to run, depending on the`
			`maximum time you selected for the session to run, as well as other attributes`
			`you have specified. </p>`
			`<p>See <a href="rzaiedoclstsrch.htm#crawl">Build the document list by crawling a URL</a> for information`
			`on how to use the Web crawler with the <span>IBM Web Administration for i5/OS™ interface</span>.</p>`
			`</div>`
			`<div>`
			`<div class="familylinks">`
			`<div class="parentlink"><strong>Parent topic:</strong> <a href="rzaieconcepts.htm" title="This topic provides concepts of functions on HTTP Server and IBM Web Administration for i5/OS interface.">Concepts of functions of HTTP Server</a></div>`
			`</div>`
			`</div>`
			`</body>`
			`</html>`