Web crawling on HTTP Server

This topic provides information about Web crawling and Web crawlers.

Important: Information for this topic supports the latest PTF levels for HTTP Server for i5/OS . It is recommended that you install the latest PTFs to upgrade to the latest level of the HTTP Server for i5/OS. Some of the topics documented here are not available prior to this update. See http://www.ibm.com/servers/eserver/iseries/software/http/services/service.htm Link outside Information Center for more information.

A Web crawler is a program that finds a URL on another Web server. A "crawl" is the Web crawler program following links within Web pages and downloading HTML and text pages it finds. The Web crawler downloads files to your local directory, and creates a document list. The document list and the files can then be used to create a search index. The search results will link to the actual URL that was found during the crawl.

Attention: The Web crawler downloads text and HTML files to your iSeries™. The iSeries checks if sufficient memory is available for a successful Web crawl, but it will not check for available storage.

To crawl a Web site, you must specify attributes such as the document storage directory, the URL to crawl, and so on. Alternately, you may start a crawl using a URL and options object that you have already created using other forms. A URL object contains a list of URLs. An options object contains crawling attributes, such as the proxy server to use for each crawling session.

Some sites cannot be entered without some sort of authentication, such as a userid and password, or certificate authentication. The web crawler has the capacity to handle either case as long as you do the required set up.

For a site requiring a userid and password, you must create a validation list object, entering the URL, userid, and password. See Set up validation lists for the Webserver search engine on HTTP Server for more information. Then be sure to enter the validation list object when you start crawling. See the digital server certificate information on how to obtain certificate authentication. The digital certificate manager can be used to obtain a new, or register an existing, certificate for any secure server instance of the IBM® HTTP Server.

Building a document list by crawling Web sites always runs as a background task and will take several minutes, at a minimum, to run, depending on the maximum time you selected for the session to run, as well as other attributes you have specified.

See Build the document list by crawling a URL for information on how to use the Web crawler with the IBM Web Administration for i5/OS™ interface.