Start HTTP Crawler (STRHTTPCRL)

Where allowed to run: All environments (*ALL)
Threadsafe: No
Parameters
Examples
Error messages

The Start HTTP Crawling (STRHTTPCRL) command allows you to create or append to a document list by crawling remote web sites, downloading files found, and saving the path names in the document list specified.

To create a document list, specify *CRTDOCL for the Option (OPTION) parameter.

To update a document list, specify *UPDDOCL for the OPTION parameter.

Top

Parameters

Keyword Description Choices Notes
OPTION Option *CRTDOCL, *UPDDOCL Required, Positional 1
METHOD Crawling method *OBJECTS, *DETAIL Optional
OBJECTS URL and options objects Element list Optional
Element 1: URL object Character value
Element 2: Options object Character value
DOCLIST Document list file Path name Optional
DOCDIR Document storage directory Path name, '/QIBM/USERDATA/HTTPSVR/INDEX/DOC' Optional
LANG Language of documents *ARABIC, *BALTIC, *CENTEUROPE, *CYRILLIC, *ESTONIAN, *GREEK, *HEBREW, *JAPANESE, *KOREAN, *SIMPCHINESE, *TRADCHINESE, *THAI, *TURKISH, *WESTERN Optional
URL URL Character value Optional
URLFTR URL filter Character value, *NONE Optional
MAXDEPTH Maximum crawling depth 0-100, 3, *NOMAX Optional
ENBROBOT Enable robots *YES, *NO Optional
PRXSVR Proxy server for HTTP Character value, *NONE Optional
PRXPORT Proxy port for HTTP 1-65535 Optional
PRXSVRSSL Proxy server for HTTPS Character value, *NONE Optional
PRXPORTSSL Proxy port for HTTPS 1-65535 Optional
MAXSIZE Maximum file size 1-6000, 1000 Optional
MAXSTGSIZE Maximum storage size 1-65535, 100, *NOMAX Optional
MAXTHD Maximum threads 1-50, 20 Optional
MAXRUNTIME Maximum run time Single values: *NOMAX
Other values: Element list
Optional
Element 1: Hours 0-1000, 2
Element 2: Minutes 0-59, 0
LOGFILE Logging file Path name, *NONE Optional
CLRLOG Clear logging file *YES, *NO Optional
VLDL Validation list Name, *NONE Optional
Top

Option (OPTION)

Specifies the document list task to perform.

This is a required parameter.

*CRTDOCL
Create a document list. If the file already exists, it will be replaced.
*UPDDOCL
Append additional document paths to a document list.
Top

Crawling method (METHOD)

Specifies the crawling method to use.

*DETAIL
Use specific values for crawling remote web sites such as the document storage directory, a URL, and a URL filter. These are the same values that are contained in a URL object and an options object.
*OBJECTS
Use a URL object and an options object for crawling web sites. These objects contain specific values used in the crawling process.
Top

URL and options objects (OBJECTS)

Specifies the objects to use for crawling. Both must be specified. Use the Configure HTTP Search (CFGHTTPSCH) command to create the objects.

Element 1: URL object

character-value
Specify the name of the URL object to use.

Element 2: Options object

character-value
Specify the name of the options object to use.
Top

Document list file (DOCLIST)

Specifies the document list file to hold the path names of the documents found by crawling remote web sites.

path-name
Specify the document list file path name.
Top

Document storage directory (DOCDIR)

Specifies the directory to use to store the documents that are downloaded.

'/QIBM/USERDATA/HTTPSVR/INDEX/DOC'
This directory is used to store the downloaded documents.
path-name
Specify the document storage directory path name.
Top

Language of documents (LANG)

Specifies the language of the documents that are to be downloaded. These language choices are similar to the character sets or encodings that can be selected on a browser.

*WESTERN
The documents are in a Western language such as English, Finnish, French, Spanish, or German.
*ARABIC
The documents are in Arabic.
*BALTIC
The documents are in a Baltic language such as Latvian or Lithuanian.
*CENTEUROPE
The documents are in a Central European language such as Czech, Hungarian, Polish, Slovakian, or Slovenian.
*CYRILLIC
The documents are in a Cyrillic language such as Russian, Ukranian, or Macedonian.
*ESTONIAN
The documents are in Estonian.
*GREEK
The documents are in Greek.
*HEBREW
The documents are in Hebrew.
*JAPANESE
The documents are in Japanese.
*KOREAN
The documents are in Korean.
*SIMPCHINESE
The documents are in Simplified Chinese.
*TRADCHINESE
The documents are in Traditional Chinese.
*THAI
The documents are in Thai.
*TURKISH
The documents are in Turkish.
Top

URL (URL)

Specifies the name of the URL (Universal Resource Locator) to crawl.

character-value
Specify the URL to crawl.
Top

URL filter (URLFTR)

The domain filter to limit sites crawled to those within the specified domain.

*NONE
No filtering will be done base on domain.
character-value
Specify the domain filter to limit crawling.
Top

Maximum crawling depth (MAXDEPTH)

The maximum depth to crawl from the starting URL. Zero means to stop crawling at the starting URL site. Each additional layer refers to following referenced links within the current URL.

3
Referenced links will be crawled three layers deep.
*NOMAX
Referenced links will be crawled regardless of depth.
0-100
Specify the maximum crawling depth.
Top

Enable robots (ENBROBOT)

Specifies whether to enable support for robot exclusion. If you select to support robot exclusion, any site or pages that contain robot exclusion META tags or files will not be downloaded.

*YES
Enable support for robot exclusion.
*NO
Do not enable support for robot exclusion.
Top

Proxy server for HTTP (PRXSVR)

Specifies the HTTP proxy server to be used.

*NONE
Do not use an HTTP proxy server.
HTTP-proxy-server
Specify the name of the HTTP proxy server.
Top

Proxy port for HTTP (PRXPORT)

Specifies the HTTP proxy server port.

1-65535.
Specify the number of the HTTP proxy server port. This parameter is required if a proxy server name is specified for the Proxy server for HTTP (PRXSVR) parameter.
Top

Proxy server for HTTPS (PRXSVRSSL)

Specifies the HTTPS proxy server for using SSL support.

*NONE
Do not use an HTTPS proxy server.
character-value
Specify the name of the HTTPS proxy server for SSL support.
Top

Proxy port for HTTPS (PRXPORTSSL)

Specifies the HTTPS proxy server port for SSL support.

1-65535
Specify the number of the HTTPS proxy server port for SSL support. This is required if an SSL proxy server is also specified. This parameter is required if a proxy server name is specified for the Proxy server for HTTPS (PRXSVRSSL) parameter.
Top

Maximum file size (MAXSIZE)

Specifies the maximum file size, in kilobytes, to download.

1000
Download files that are no greater than 1000 kilobytes.
*NOMAX
Files will be downloaded regardless of size.
1-6000.
Specify the maximum file size to download, in kilobytes.
Top

Maximum storage size (MAXSTGSIZE)

Specifies the maximum storage size, in megabytes, to allocate for downloaded files. Crawling will end when this limit is reached.

100
Up to 100 megabytes of storage will be used for downloaded files.
*NOMAX
No maximum storage size for downloaded files.
1-65535.
Specify the maximum storage size, in megabytes, for downloaded files.
Top

Maximum threads (MAXTHD)

Specifies the maximum number of threads to start for crawling web sites. Set this value based on the system resources that are available.

20
Start up to 20 threads for web crawling.
1-50.
Specify the maximum number of threads to start.
Top

Maximum run time (MAXRUNTIME)

Specifies the maximum time for crawling to run, in hours and minutes.

Single values

*NOMAX
Run the crawling session until it completes normally or is ended by using the ENDHTTPCRL (End HTTP Crawler) command.

Element 1: Hours

2
Run the crawling session for 2 hours plus the number of minutes specified.
0-1000.
Specify the number of hours to run the crawling session.

Element 2: Minutes

0
Run the crawling session for the number of hours specified.
*SAME
Use this value when you are updating the options object, but want to use the same maximum number of minutes to run.
0-59.
Specify the number of minutes to run the crawling session. The crawling session will run for the number of hours specified in the first element of this parameter plus the number of minutes specified.
Top

Logging file (LOGFILE)

Specifies the activity logging file to be used. This file contains information about the crawling session plus any errors that occur during the crawling session. This file must be in a directory.

*NONE
Do not use an activity log file.
path-name
Specify the path name of the logging file.
Top

Clear logging file (CLRLOG)

Specifies whether to clear the activity log file before starting the crawling session.

*YES
Always clear the activity log file before each crawling session.
*NO
Do not clear the activity log file.
Top

Validation list (VLDL)

Specifies the validation list to use for SSL sessions. Use the Configure HTTP Search (CFGHTTPSCH) command to create a validation list object.

*NONE
Do not use a validation list object.
name
Specify the name of the validation list.
Top

Examples

 STRHTTPCRL OPTION(*CRTDOCL) DOCLIST('/mydir/my.doclist')
    URL('http://www.ibm.com') MAXDEPTH(2)

This command starts a new crawling session finding referenced links 2 layers from the starting URL at www.ibm.com. The document list will be created in '/mydir/my.doclist' and will contain sets of a local directory path, for example, '/QIBM/USERDATA/HTTPSVR/INDEX/DOC/www.ibm.com/us/index.html' and the actual URL to the page 'http://www.ibm.com/us/'. Use the Configure HTTP Search (CFGHTTPSCH) command to create an index using this document list.

Top

Error messages

*ESCAPE Messages

HTP160C
Request to create or append to a document list failed. Reason &1.
HTP166E
Request to print the status of a document list failed. Reason &1.
Top