Webserver search engine on HTTP Server

How it works

Before you can search, you must have an index. The index is a set of files that contain the contents of the documents (in a searchable form) that are to be searched. The search index is used by the search engine rather than searching all of the actual documents.

A search index is created based upon a document list. A document list contains a list of fully qualified path names of all the documents that you want to index.

Documents satisfying a search request are returned by default in their order of ranking. A document's ranking specifies the relevance with respect to the specified search conditions. The following factors determine a document's ranking:

Frequency of search terms in the document - As the search words appear more frequently in the document, the ranking gets higher.
Position of search terms in the document - As the search words appear closer to the beginning of the document, the ranking gets higher.
Frequency of search terms in the whole set of documents - As the search words appear less frequently within the documents in the entire index, the ranking for documents that have search words gets higher.

It is possible that a document with one search term appearing toward the beginning of the document can have a higher ranking than a document with multiple search terms appearing near the end of the document. The search function assumes that words indicating the subject or topic of the document usually appear near the beginning of the document. The highest ranking a document can have is 100%. A document can achieve a ranking of 100% if relatively few of the documents in the index contain the search terms. If many documents in the index contain the search terms, it is likely that none of the documents would achieve a ranking of 100%.

You can provide the following search functions through the customized Net.Data macros:

Exact search - 100% of the letters match. For example street returns street, Street, and STREET.
Fuzzy search - 60% of the letters match. For example street returns street, streets, treat, and Tree.
Wild card search - an asterisk (*) is replaced by zero or more letters and a question mark (?) is replaced by one letter. For example jump* returns jump, jumps, Jumping, and jumper.
Proximity search - two or more words in the same sentence.
English word stemming - for example, knife returns knife and knives.
Case sensitive search - for example, Street returns Street, not street.
Boolean search (simple) - for example, A and B and C.
Boolean Search (advanced) - for example, (A and (B or C) not D).
Document ranking - documents are automatically sorted according to ranking.
Thesaurus support - finds synonyms or related terms of a search word.
Search within results - search within returned search results only.
Simple and Advanced search

You can enhance search results through the use of the thesaurus support. A thesaurus contains words that are synonyms or related terms of a search word. For example, searching for Ping-Pong without thesaurus support results only in documents containing the string Ping-Pong. Using thesaurus support that includes synonyms for Ping-Pong, such as table tennis, results in documents containing either the string Ping-Pong or table tennis.

The URL mapping rules file, built from your selected HTTP Server, is used to set the URL for each document found on a search. It can specify the server port number (or instance) to use and can also map resulting file path names to external path names.

Sample files

Several files are shipped with the product for your use to customize your own Web search function:

File	Description
`/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.ndm`	Sample Net.Data macro that you can customize.
`QIBM/ProdData/HTTP/Public/HTTPSVR/ thesaurus_sample_search.ndm`	Sample Net.Data macro with thesaurus support that you can customize.
`/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.html`	Sample search HTML file.
`/QIBM/ProdData/HTTP/Public/HTTPSVR/HTML/`	Directory of sample HTML files that you can use to build a test search index.
`/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_thesaurus.txt`	Sample thesaurus definition file.

National language considerations

Documents that you are indexing can be encoded in most ASCII codepages and EBCDIC CCSIDs. Because the search engine does not support all CCSIDs, your documents might be converted to one of the supported CCSIDs during the indexing process. To see the CCSID used to index your documents, view the status of the search index.

Wildcard characters in search strings are not allowed for double byte languages. A wildcard search is implied for double byte languages. Both the name of the index and index directory name must be specified in a single byte characters. The contents of documents are often converted to one of the index CCSIDs listed below.

Documents in languages from the included character sets can all be contained in the same index, as long as the documents are indexed separately. For example, an index can contain English and French documents. Create the index including just the English documents, then update the index with the French documents. If you attempt to index Italian and Russian documents in the same index, an error will occur since the two languages cannot be converted to a common index CCSID. In this case you would need to create two separate indexes. The following table describes the supported CCSIDs for indexes.

Index CCSID	Code page name	Included character sets (CCSIDs)
`500`	Latin 1	International Albanian, Belgian English, Belgian French, Canadian French MNCS, Danish, Dutch, Dutch MNCS, English International, English US, Finnish, French (France), French MNCS, German (Germany), German MNCS, Icelandic, Italian, Latin 1/Open Systems, Norwegian, Portuguese (Brazil), Portuguese (Portugal), Swedish
`838`	Thai	Thai
`870`	Latin 2	Croatian, Czech, Hungarian, Polish, Romanian, Serbian (Latin), Slovak, Slovenia
`1025`	Cyrillic	Bulgarian, Macedonian, Russian, Serbian (Cyrillic)
`1026`	Latin 5	Turkish
`875`	Greek	Greek
`424`	Hebrew	Hebrew
`420`	Arabic	Arabic
`1112`	Baltic	Latvian, Lithuanian
`1122`	Estonian	Estonian
`935`	Simplified Chinese (GB)	Simplified Chinese (GB)
`1388`	Simplified Chinese (GBK)	Simplified Chinese (GBK)
`937`	Traditional Chinese	Traditional Chinese
`5026 (930)`	Japanese Katakana	Japanese Katakana
`5035 (939)`	Japanese Latin	Japanese Latin
`1364 (933)`	Korean	Korean

Browser and CL command interface for the Webserver search engine and Web crawler

This table shows the browser and CL command interface to all of the search engine and web crawling tasks.

Task	Browser form	CL command
Create an index	Create search index	CFGHTTPSCH OPTION(*CRTIDX)
Update an index	Update search index	CFGHTTPSCH OPTION(ADDDOC) CFGHTTPSCH OPTION(RMVDOC)
Merge an index	Merge search index	CFGHTTPSCH OPTION(*MRGIDX)
Delete an index	Delete search index	CFGHTTPSCH OPTION(DLTIDX) V4R4 View the status of an index View status of search index: CFGHTTPSCH OPTION(PRTIDXSTS)
View the status of an index	View status of search index	CFGHTTPSCH OPTION(*PRTIDXSTS) See spoolfile QPZHASRCH
Create a document list Start the web crawler	Build a document list	CFGHTTPSCH OPTION(CRTDOCL) - local STRHTTPCRL OPTION(CRTDOCL) - web crawler
Add documents to a document list	Build a document list	CFGHTTPSCH OPTION(UPDDOCL) Use for local documents. STRHTTPCRL OPTION(UPDDOCL) Use for documents found with the web crawler.
Stop a web crawling session.	Work with document list status	ENDHTTPCRL
Pause a web crawling session.	Work with document list status	ENDHTTPCRL
Resume a web crawling session.	Work with document list status	RSMHTTPCRL
Register a document list created before V4R5	Register document list	CFGHTTPSCH OPTION(*REGDOCL)
Delete a document list	Delete document list	CFGHTTPSCH OPTION(*DLTDOCL)
Display information about a document list	Work with document list status	CFGHTTPSCH OPTION(*PRTDOCLSTS) See spoolfile QPZHASRCH
Create a URL mapping rules file	Build URL mapping rules file	CFGHTTPSCH OPTION(*CRTMAPF)
Append a URL mapping rules file	Build URL mapping rules file	CFGHTTPSCH OPTION(*UPDMAPF)
Build a thesaurus dictionary	Build thesaurus dictionary	CFGHTTPSCH OPTION(*CRTTHSDCT)
Test a thesaurus dictionary	Test thesaurus dictionary	None.
Retrieve a thesaurus definition from a dictionary	Retrieve thesaurus definition	CFGHTTPSCH OPTION(*RTVTHSDFNF)
Delete a thesaurus dictionary	Delete thesaurus dictionary	CFGHTTPSCH OPTION(*DLTTHSDCT)
Create a list of URLs to crawl	Build URL object	CFGHTTPSCH OPTION(*CRTURLOBJ)
Update a list of URLs to crawl	Build URL object	CFGHTTPSCH OPTION(*UPDURLOBJ)
Delete a list of URLs to crawl	Delete URL object	CFGHTTPSCH OPTION(*DLTURLOBJ)
Create an object containing crawling attributes	Build options object	CFGHTTPSCH OPTION(*CRTOPTOBJ)
Update an object containing crawling attributes	Build options object	CFGHTTPSCH OPTION(*UPDOPTOBJ)
Build an object with userid and passwords for authentication	Build validation list	CFGHTTPSCH OPTION(*CRTVLDL)
Add userids and passwords for authentication.	Build validation list	CFGHTTPSCH OPTION(*ADDVLDLDTA)
Remove userids and passwords for authentication.	Build validation list	CFGHTTPSCH OPTION(*RMVVLDLDTA)
Search an index	Search index	None