Hints and tips for Portal Search crawls

View some useful tips about Portal Search crawls. For example, crawling can require extended memory and time, depending on your Portal Search environment and configuration.

HTTP crawler does not support JavaScript

The HTTP crawler of the Portal Search Service does not support JavaScript. Therefore, some text of web documents might not be accessible for search by users. Accessibility depends on how the text is prepared for presentation in the browser. Specifically, text that is generated by JavaScript might or might not be available for search.

Crawling a portal site for the first time can result in a message

Starting a crawl on a portal site for the first time can result in the following message:
     EJPJP0009E: Wrong root url for Portal site crawler: https://root_url
You can ignore this message. The crawl runs correctly.

To resolve this problem, edit the content source, select the General Parameters tab, and then set the parameter Stop fetching documents after (seconds): to a value of 90 seconds.

Memory required for crawls

Depending on your Portal Search environment, crawling can require large amounts of memory. Therefore, before you start a crawl, make sure that HCL has enough free memory. Memory shortage can cause a corrupted search collection and eventually lead to a system freeze.

To resolve this problem, raise the limit to the number of open files by using the ulimit command as root administrator.

Due to the resources needed for a crawl and index, it is useful that you schedule crawls to occur when user activity is relatively low.

Time required for crawls and imports and availability of documents

The following search administration tasks can require extended periods of time:

  • Crawling a content source. Documents might not be immediately available for searching or browsing during the crawl.
  • Indexing the documents fetched by a crawl. When a crawl is complete and all documents are collected, building the index takes some more time.
  • Importing a search collection. When you import data to a collection, it can take some time until the content sources for the collection are shown in the Content Sources in Collection box and the documents of the imported collection are available for crawling.

These tasks are put in a queue. Therefore, it might take several minutes until they are run and the respective time counters start. For example, the crawl Run time and the timeout for the crawl set by the option Stop collecting after (minutes): . The time that is required for these tasks is further influenced by the following factors:

  • The number of documents in the content source that is being crawled
  • The size of the documents in the content source that is being crawled
  • Speed and availability of your processors, hard drive storage systems, and network connection.
  • The value that you selected from the Stop collecting after (minutes): drop-down menu when you created or edited the content source.

Therefore, both the time limits that you can specify and the times that are shown for these processes work as approximate time limits. For example, these time limits apply to the following scenarios:

  • When you start a crawl by selecting a content source in the Content Sources in Collection box and clicking Start collecting.
  • When you import a search collection and when you start a crawl on the imported search collection.
  • When an installation is complete and you initialize the pre-configured portal site collection by selecting the portal site content source and clicking Start collecting.
  • The time that is shown under Last update completed in the collection status information is later than you might assume. This delay is caused by the additional time required by building the index.

Furthermore, these time limits influence other status indicators given in the Manage Search portlet. For example, the number of documents that are shown for a content source could be unexpectedly low or even zero until the crawl on that content source is complete.

Refreshing different types of content sources

Clicking Start Crawler updates the contents of the content source by a new run of the crawler. During the run, the icon changes to Stop Crawler. You can click to end the run. Portal Search refreshes different content sources as follows:
  • For website content sources, documents that were indexed before and still exist in the content source are updated. Documents that were indexed before, but no longer exist are retained in the search collection. Documents that are new in the content source are indexed and added to the collection.
  • For HCL Portal sites, the crawl adds all pages and portlets to the content source. It deletes portlets and static pages from the content source that were removed from the portal. The crawl works similarly to the option Regather documents from Content Source.
  • For HCL Web Content Manager sites, Portal Search uses an incremental crawling method. In additions to added and updated content, the Seedlist explicitly specifies deleted content. In contrast, clicking Regather documents from Content Source starts a full crawl. It does not continue from the last session, and it is not incremental.
  • For content sources created with the seedlist provider option, a crawl on a remote system that supports incremental crawling, such as HCL Connections, behaves like a crawl on a Web Content Manager site.

Defining a dedicated crawler user ID

It is beneficial to define a dedicated crawler user ID. The pre-configured default portal site search uses the default administrator user ID wpsadmin with the default password of that user ID for the crawler. If you changed the default administrator user ID during your portal installation, the crawler uses that default user ID. If you changed the user ID or password for the administrative user ID and still want to use it for the Portal Search crawler, you need to adapt the settings.

To define a crawler user ID, select the Security tab, and update the user ID and password. Click Save.

Changing the source scope

If you modify a content source that belongs to a search scope, update the scope manually to make sure that it still covers that content source. If you changed the name of the content source, edit the scope and make sure that the content source is still listed there. If not, you must add it again.