Start HTTP Crawler (STRHTTPCRL)

The Start HTTP Crawling (STRHTTPCRL) command allows you to create or append to a document list by crawling remote web sites, downloading files found, and saving the path names in the document list specified.

To create a document list, specify *CRTDOCL for the Option (OPTION) parameter.

To update a document list, specify *UPDDOCL for the OPTION parameter.

Parameters

Keyword	Description	Choices	Notes
OPTION	Option	CRTDOCL, UPDDOCL	Required, Positional 1
METHOD	Crawling method	OBJECTS, DETAIL	Optional
OBJECTS	URL and options objects	Element list	Optional
	Element 1: URL object	Character value
	Element 2: Options object	Character value
DOCLIST	Document list file	Path name	Optional
DOCDIR	Document storage directory	Path name, '/QIBM/USERDATA/HTTPSVR/INDEX/DOC'	Optional
LANG	Language of documents	ARABIC, BALTIC, CENTEUROPE, CYRILLIC, ESTONIAN, GREEK, HEBREW, JAPANESE, KOREAN, SIMPCHINESE, TRADCHINESE, THAI, TURKISH, WESTERN	Optional
URL	URL	Character value	Optional
URLFTR	URL filter	Character value, *NONE	Optional
MAXDEPTH	Maximum crawling depth	0-100, 3, *NOMAX	Optional
ENBROBOT	Enable robots	YES, NO	Optional
PRXSVR	Proxy server for HTTP	Character value, *NONE	Optional
PRXPORT	Proxy port for HTTP	1-65535	Optional
PRXSVRSSL	Proxy server for HTTPS	Character value, *NONE	Optional
PRXPORTSSL	Proxy port for HTTPS	1-65535	Optional
MAXSIZE	Maximum file size	1-6000, 1000	Optional
MAXSTGSIZE	Maximum storage size	1-65535, 100, *NOMAX	Optional
MAXTHD	Maximum threads	1-50, 20	Optional
MAXRUNTIME	Maximum run time	Single values: NOMAX Other values: Element list*	Optional
	Element 1: Hours	0-1000, 2
	Element 2: Minutes	0-59, 0
LOGFILE	Logging file	Path name, *NONE	Optional
CLRLOG	Clear logging file	YES, NO	Optional
VLDL	Validation list	Name, *NONE	Optional

Top

Option (OPTION)

Specifies the document list task to perform.

This is a required parameter.

*CRTDOCL: Create a document list. If the file already exists, it will be replaced.
*UPDDOCL: Append additional document paths to a document list.

Crawling method (METHOD)

Specifies the crawling method to use.

*DETAIL: Use specific values for crawling remote web sites such as the document storage directory, a URL, and a URL filter. These are the same values that are contained in a URL object and an options object.
*OBJECTS: Use a URL object and an options object for crawling web sites. These objects contain specific values used in the crawling process.

URL and options objects (OBJECTS)

Specifies the objects to use for crawling. Both must be specified. Use the Configure HTTP Search (CFGHTTPSCH) command to create the objects.

Element 1: URL object

character-value: Specify the name of the URL object to use.

Element 2: Options object

character-value: Specify the name of the options object to use.

Document storage directory (DOCDIR)

Specifies the directory to use to store the documents that are downloaded.

'/QIBM/USERDATA/HTTPSVR/INDEX/DOC': This directory is used to store the downloaded documents.
path-name: Specify the document storage directory path name.

Language of documents (LANG)

Specifies the language of the documents that are to be downloaded. These language choices are similar to the character sets or encodings that can be selected on a browser.

*WESTERN: The documents are in a Western language such as English, Finnish, French, Spanish, or German.
*ARABIC: The documents are in Arabic.
*BALTIC: The documents are in a Baltic language such as Latvian or Lithuanian.
*CENTEUROPE: The documents are in a Central European language such as Czech, Hungarian, Polish, Slovakian, or Slovenian.
*CYRILLIC: The documents are in a Cyrillic language such as Russian, Ukranian, or Macedonian.
*ESTONIAN: The documents are in Estonian.
*GREEK: The documents are in Greek.
*HEBREW: The documents are in Hebrew.
*JAPANESE: The documents are in Japanese.
*KOREAN: The documents are in Korean.
*SIMPCHINESE: The documents are in Simplified Chinese.
*TRADCHINESE: The documents are in Traditional Chinese.
*THAI: The documents are in Thai.
*TURKISH: The documents are in Turkish.

Maximum crawling depth (MAXDEPTH)

The maximum depth to crawl from the starting URL. Zero means to stop crawling at the starting URL site. Each additional layer refers to following referenced links within the current URL.

3: Referenced links will be crawled three layers deep.
*NOMAX: Referenced links will be crawled regardless of depth.
0-100: Specify the maximum crawling depth.

Enable robots (ENBROBOT)

Specifies whether to enable support for robot exclusion. If you select to support robot exclusion, any site or pages that contain robot exclusion META tags or files will not be downloaded.

*YES: Enable support for robot exclusion.
*NO: Do not enable support for robot exclusion.

Proxy port for HTTPS (PRXPORTSSL)

Specifies the HTTPS proxy server port for SSL support.

1-65535: Specify the number of the HTTPS proxy server port for SSL support. This is required if an SSL proxy server is also specified. This parameter is required if a proxy server name is specified for the Proxy server for HTTPS (PRXSVRSSL) parameter.

Maximum file size (MAXSIZE)

Specifies the maximum file size, in kilobytes, to download.

1000: Download files that are no greater than 1000 kilobytes.
*NOMAX: Files will be downloaded regardless of size.
1-6000.: Specify the maximum file size to download, in kilobytes.

Maximum storage size (MAXSTGSIZE)

Specifies the maximum storage size, in megabytes, to allocate for downloaded files. Crawling will end when this limit is reached.

100: Up to 100 megabytes of storage will be used for downloaded files.
*NOMAX: No maximum storage size for downloaded files.
1-65535.: Specify the maximum storage size, in megabytes, for downloaded files.

Maximum threads (MAXTHD)

Specifies the maximum number of threads to start for crawling web sites. Set this value based on the system resources that are available.

20: Start up to 20 threads for web crawling.
1-50.: Specify the maximum number of threads to start.

Maximum run time (MAXRUNTIME)

Specifies the maximum time for crawling to run, in hours and minutes.

Single values

*NOMAX: Run the crawling session until it completes normally or is ended by using the ENDHTTPCRL (End HTTP Crawler) command.

Element 1: Hours

2: Run the crawling session for 2 hours plus the number of minutes specified.
0-1000.: Specify the number of hours to run the crawling session.

Element 2: Minutes

0: Run the crawling session for the number of hours specified.
*SAME: Use this value when you are updating the options object, but want to use the same maximum number of minutes to run.
0-59.: Specify the number of minutes to run the crawling session. The crawling session will run for the number of hours specified in the first element of this parameter plus the number of minutes specified.

Logging file (LOGFILE)

Specifies the activity logging file to be used. This file contains information about the crawling session plus any errors that occur during the crawling session. This file must be in a directory.

*NONE: Do not use an activity log file.
path-name: Specify the path name of the logging file.

Examples

 STRHTTPCRL OPTION(*CRTDOCL) DOCLIST('/mydir/my.doclist')
    URL('http://www.ibm.com') MAXDEPTH(2)

This command starts a new crawling session finding referenced links 2 layers from the starting URL at www.ibm.com. The document list will be created in '/mydir/my.doclist' and will contain sets of a local directory path, for example, '/QIBM/USERDATA/HTTPSVR/INDEX/DOC/www.ibm.com/us/index.html' and the actual URL to the page 'http://www.ibm.com/us/'. Use the Configure HTTP Search (CFGHTTPSCH) command to create an index using this document list.