Webserver search engine on HTTP Server

This topic provides information about the Webserver search engine and national language considerations.

Important: Information for this topic supports the latest PTF levels for HTTP Server for i5/OS . It is recommended that you install the latest PTFs to upgrade to the latest level of the HTTP Server for i5/OS. Some of the topics documented here are not available prior to this update. See http://www.ibm.com/servers/eserver/iseries/software/http/services/service.htm Link outside Information Center for more information.

The Webserver search engine allows you to perform full text searches on HTML and text files. You can control what options are available to the user and how the search results are displayed through customized Net.Data® macros. You can enhance search results by using the thesaurus support. For information on configuring the search engine with the HTTP Server (powered by Apache), see Set up the Webserver search engine on HTTP Server (powered by Apache).

How it works

Before you can search, you must have an index. The index is a set of files that contain the contents of the documents (in a searchable form) that are to be searched. The search index is used by the search engine rather than searching all of the actual documents.

A search index is created based upon a document list. A document list contains a list of fully qualified path names of all the documents that you want to index.

Documents satisfying a search request are returned by default in their order of ranking. A document's ranking specifies the relevance with respect to the specified search conditions. The following factors determine a document's ranking:

It is possible that a document with one search term appearing toward the beginning of the document can have a higher ranking than a document with multiple search terms appearing near the end of the document. The search function assumes that words indicating the subject or topic of the document usually appear near the beginning of the document. The highest ranking a document can have is 100%. A document can achieve a ranking of 100% if relatively few of the documents in the index contain the search terms. If many documents in the index contain the search terms, it is likely that none of the documents would achieve a ranking of 100%.

You can provide the following search functions through the customized Net.Data macros:

You can enhance search results through the use of the thesaurus support. A thesaurus contains words that are synonyms or related terms of a search word. For example, searching for Ping-Pong without thesaurus support results only in documents containing the string Ping-Pong. Using thesaurus support that includes synonyms for Ping-Pong, such as table tennis, results in documents containing either the string Ping-Pong or table tennis.

The URL mapping rules file, built from your selected HTTP Server, is used to set the URL for each document found on a search. It can specify the server port number (or instance) to use and can also map resulting file path names to external path names.

Sample files

Several files are shipped with the product for your use to customize your own Web search function:

File Description
/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.ndm

Sample Net.Data macro that you can customize.

QIBM/ProdData/HTTP/Public/HTTPSVR/ thesaurus_sample_search.ndm

Sample Net.Data macro with thesaurus support that you can customize.

/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.html

Sample search HTML file.

/QIBM/ProdData/HTTP/Public/HTTPSVR/HTML/

Directory of sample HTML files that you can use to build a test search index.

/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_thesaurus.txt

Sample thesaurus definition file.

National language considerations

Documents that you are indexing can be encoded in most ASCII codepages and EBCDIC CCSIDs. Because the search engine does not support all CCSIDs, your documents might be converted to one of the supported CCSIDs during the indexing process. To see the CCSID used to index your documents, view the status of the search index.

Wildcard characters in search strings are not allowed for double byte languages. A wildcard search is implied for double byte languages. Both the name of the index and index directory name must be specified in a single byte characters. The contents of documents are often converted to one of the index CCSIDs listed below.

Documents in languages from the included character sets can all be contained in the same index, as long as the documents are indexed separately. For example, an index can contain English and French documents. Create the index including just the English documents, then update the index with the French documents. If you attempt to index Italian and Russian documents in the same index, an error will occur since the two languages cannot be converted to a common index CCSID. In this case you would need to create two separate indexes. The following table describes the supported CCSIDs for indexes.

Index CCSID Code page name Included character sets (CCSIDs)
500

Latin 1

International Albanian, Belgian English, Belgian French, Canadian French MNCS, Danish, Dutch, Dutch MNCS, English International, English US, Finnish, French (France), French MNCS, German (Germany), German MNCS, Icelandic, Italian, Latin 1/Open Systems, Norwegian, Portuguese (Brazil), Portuguese (Portugal), Swedish

838

Thai

Thai

870

Latin 2

Croatian, Czech, Hungarian, Polish, Romanian, Serbian (Latin), Slovak, Slovenia

1025

Cyrillic

Bulgarian, Macedonian, Russian, Serbian (Cyrillic)

1026

Latin 5

Turkish

875

Greek

Greek

424

Hebrew

Hebrew

420

Arabic

Arabic

1112

Baltic

Latvian, Lithuanian

1122

Estonian

Estonian

935

Simplified Chinese (GB)

Simplified Chinese (GB)

1388

Simplified Chinese (GBK)

Simplified Chinese (GBK)

937

Traditional Chinese

Traditional Chinese

5026 (930)

Japanese Katakana

Japanese Katakana

5035 (939)

Japanese Latin

Japanese Latin

1364 (933)

Korean

Korean

Browser and CL command interface for the Webserver search engine and Web crawler

This table shows the browser and CL command interface to all of the search engine and web crawling tasks.

Task Browser form CL command

Create an index

Create search index

CFGHTTPSCH OPTION(*CRTIDX)

Update an index

Update search index

CFGHTTPSCH OPTION(*ADDDOC)

CFGHTTPSCH OPTION(*RMVDOC)

Merge an index

Merge search index

CFGHTTPSCH OPTION(*MRGIDX)

Delete an index

Delete search index

CFGHTTPSCH OPTION(*DLTIDX)

V4R4 View the status of an index View status of search index:

CFGHTTPSCH OPTION(*PRTIDXSTS)

View the status of an index

View status of search index

CFGHTTPSCH OPTION(*PRTIDXSTS)

See spoolfile QPZHASRCH

Create a document list

Start the web crawler

Build a document list

CFGHTTPSCH OPTION(*CRTDOCL) - local

STRHTTPCRL OPTION(*CRTDOCL) - web crawler

Add documents to a document list

Build a document list

CFGHTTPSCH OPTION(*UPDDOCL)

Use for local documents.

STRHTTPCRL OPTION(*UPDDOCL)

Use for documents found with the web crawler.

Stop a web crawling session.

Work with document list status

ENDHTTPCRL

Pause a web crawling session.

Work with document list status

ENDHTTPCRL

Resume a web crawling session.

Work with document list status

RSMHTTPCRL

Register a document list created before V4R5

Register document list

CFGHTTPSCH OPTION(*REGDOCL)

Delete a document list

Delete document list

CFGHTTPSCH OPTION(*DLTDOCL)

Display information about a document list

Work with document list status

CFGHTTPSCH OPTION(*PRTDOCLSTS)

See spoolfile QPZHASRCH

Create a URL mapping rules file

Build URL mapping rules file

CFGHTTPSCH OPTION(*CRTMAPF)

Append a URL mapping rules file

Build URL mapping rules file

CFGHTTPSCH OPTION(*UPDMAPF)

Build a thesaurus dictionary

Build thesaurus dictionary

CFGHTTPSCH OPTION(*CRTTHSDCT)

Test a thesaurus dictionary

Test thesaurus dictionary

None.

Retrieve a thesaurus definition from a dictionary

Retrieve thesaurus definition

CFGHTTPSCH OPTION(*RTVTHSDFNF)

Delete a thesaurus dictionary

Delete thesaurus dictionary

CFGHTTPSCH OPTION(*DLTTHSDCT)

Create a list of URLs to crawl

Build URL object

CFGHTTPSCH OPTION(*CRTURLOBJ)

Update a list of URLs to crawl

Build URL object

CFGHTTPSCH OPTION(*UPDURLOBJ)

Delete a list of URLs to crawl

Delete URL object

CFGHTTPSCH OPTION(*DLTURLOBJ)

Create an object containing crawling attributes

Build options object

CFGHTTPSCH OPTION(*CRTOPTOBJ)

Update an object containing crawling attributes

Build options object

CFGHTTPSCH OPTION(*UPDOPTOBJ)

Build an object with userid and passwords for authentication

Build validation list

CFGHTTPSCH OPTION(*CRTVLDL)

Add userids and passwords for authentication.

Build validation list

CFGHTTPSCH OPTION(*ADDVLDLDTA)

Remove userids and passwords for authentication.

Build validation list

CFGHTTPSCH OPTION(*RMVVLDLDTA)

Search an index

Search index

None