Manage Search

Use the Manage Search portlet to administer portal search.

To manage Portal Search, click the Administration menu icon in the toolbar. Then, click Portal User Interface > Manage Search from the portal menu. The portal displays the administration portlet Manage Search.

Note: This portlet help gives instructions for using the Manage Search portlet only. For more information about search services, collections and scopes, planning considerations and configuring search, see Portal Search in the product documentation.

Search Services

From Search Services, you can view and manage the HCL Digital Experience search services. Search Services represent separate instances of the search engine that is provided and can be used for searching content by using the Search Center. When you create a search collection, you must select a search service so that users can request searches on that collection. A search service can be used for searching multiple search collections. You can set parameters to configure a search service that sets up separate instances of search services with different configurations. You can also set up multiple portal search services and distribute the search load over several nodes. The following Search Service is provided by default:
Portal Search Service
Select the Portal Search Service to manage search collections that contain portal pages, content that is managed by HCL Web Content Manager, or indexed web pages. In a clustered environment, you need to set up a remote search service.
Note: The HTTP crawler of the Portal Search Service does not support JavaScript. Text that is generated by JavaScript might not be available for search.

You can also create more custom search services and add them to your portal.

Creating a search service
To create a new search service, click New Search Service. Manage Search displays the New Search Service page. Specify a Service name that is unique within the current portal or virtual portal.

Search Collections

From Search Collections, you can view and manage the search collections and their content sources in the portal. You can build and maintain search collections of web content, Web Content Manager content, portal content, and the related search collections. Users can then search these collections by using the portal Search Center.

A search collection can have one or more content sources such as web pages, Web Content Manager content, or portal pages and portlets. The portal default search collection combines two content sources and their related crawlers:

Portal Content Source
The Portal Content Source contains the local portal site, where users can search for portal pages and portlets.
Web Content Manager Content Source
The Web Content Manager Content Source allows users to search for web content.

During the search collection, build process, content is retrieved for indexing through a crawler (robot) from the content sources. The search collection stores keywords and metadata and maps them to their original source. It allows fast processing of requests from the Search Center portlet.

Resources can be stored on the local portal server or on remote content sources for searching. Content can be processed by the crawlers, if it is accessible through the HTTP protocol. For example, this content can be from portal pages, Web Content Manager, and documents that are hosted by web servers. The documents can be of different types, for example, editable text files, office suite documents, such as Microsoft and OpenOffice, or PDF files.

Managing Search Collections

From the Search Collections panel, select the following options or icons to run the following tasks on search collections:
  • Select Refresh to update the information and the available option icons for the collections. Examples:
    • If a crawl is running or was completed, the number of documents is updated.
    • If a crawl was completed on a collection since the last refresh, option icons can appear, such as Search and Browse the Collection.
    • If another administrator updated search collections, the information is refreshed.
  • From the Search Collections page, you can import and export search collections. You can also view the status of the search collection and manage the content sources by clicking the search collection name.
    Note: The icons for some tasks are only available if the current user can do the specific task on the search collection.

Creating a search collection

Some of the following entry fields and options are available when you create a search collection:

Note: The parameters that you select when you create the search collection cannot be changed later. Therefore, plan ahead to carefully create a new search collection. If you want to change the parameters, you must start over by creating a new search collection with new parameters. You can then export the data from the old collection and import it into the new collection. For information, refer to Exporting a search collection and Importing a search collection.
Location of Collection
Use this entry field to type the directory path where you want the new search collection to be created and the related data to be saved. You can insert a full path or a path relative to the Collections Locations search service parameter. The search collection is created in the following location:
  • If you type a name of your choice, the location for the new search collection is combined from the default directory and the name. Example: If you type my_collection_location, the new search collection is created under the directory wp_root/collections/my_collection_location. For details about the default directory for search collections and how you configure it, refer to Configuring the default location for search collection in product documentation under Portal Search.
  • If you type the full directory path, the location for the new search collection is different from the default search collection location. The new search collection is created under the directory location that you specify.
Name of Collection
Use this entry field to name the new search collection. If you do not enter a name, the location that you entered in the previous field is used for the search collection.
Specify Collection Language
Use this menu to select a language for the search collection. The search collection and its index are optimized for the language. This feature enhances the quality of search results for users, as it allows them to use spelling variants, including plurals and inflections, for the search keyword. Search uses this language for indexing if no language is defined for the document. Select one of the Unspecified options to index documents without any stemming of the words.
Note: This setting is not overwritten when you import a search collection, for example, during the migration of a search collection. If you create the search collection for migrating an existing collection, choose the selection to match the source collection.
Select Summarizer
Use this menu to select a summarizer for the search collection. Choose from the following options:
  • Choose None if no summary is generated. If you select this option, the Search Center uses the description metadata from the document, if one is available.
  • Choose Automatic if an automatic summarizer is generated.

Viewing the status of a search collection

To view the status of the search collection, click the collection name in the list of search collections. Manage Search shows the Content Source Name and the Search collection status information of the selected search collection. The status fields show data that changes over the lifetime of the search collection. Some data that displays is as follows:
Last update completed:
Shows the date when a content source was last updated by a scheduled crawl and indexed.
Note: The timeout that you might set under Stop collecting after (minutes): is an approximation. It might be exceeded by some percentage, as indexing the documents after the crawl takes more time.

If you have a faulty search collection in your portal, the portlet shows a link to that faulty collection.

Migrating search collections

When you upgrade to a higher version of HCL, the data storage format is not necessarily compatible with the older version. To prevent loss of data, export all data of search collections to XML files before you upgrade. After the upgrade, you create a search collection and use the previously exported data to import the search collection data back into your upgraded portal.

  1. If you do not complete these steps, the search collections are lost after you upgrade.
  2. When you create the search collection on the upgraded portal, type data and make selections as follows:
    • Enter the location, name, and description of the new collection. You can match the old settings or type new ones.
    • You do not need to select a summarizer. These settings are overwritten when you import the data from the source search collection.
  3. You cannot migrate a portal site collection between different versions of HCL. If you upgrade to another version, you need to re-create the portal site collection. Proceed as follows:
    1. Document the configuration data of your portal site content source.
    2. Delete the existing content source.
    3. Upgrade your portal.
    4. On the upgraded portal, create a new portal site content source. Use the documented configuration data.
    5. Run the new portal content source.

Portlets that were crawled before the upgrade, but do not exist in the upgraded portal, are not returned by a search.

For more information about these tasks, see the topics about migrating, importing, and exporting search collections in the product documentation.

For details about how to export and import search collections, refer to Exporting a search collection and Importing a search collection.

Exporting a search collection

To export a search collection and its data, proceed as follows:
  1. Before you export a collection, make sure that the user who is running the portal application process has write access to the target directory location. Otherwise, you might get an error message, such as File not found.
  2. Make sure that the target directory is empty or contains no files that you still need, as the export can overwrite files in that directory.
  3. Locate the search collection that you want to export.
  4. Click the Import or Export Collection icon next to the search collection in the list. Manage Search displays the Import and Export Search Collection panel.
  5. In the entry field Specify Location (full path with XML extension), type the full directory path and XML file name to which you want to export the search collection and its data. Document the names of the collections and the directory locations and target file names to which you export the collections for the import that follows.
    Note: When you specify the target directory location for the export, be aware that the export can overwrite files in that directory.
  6. Click Export to export the search collection data. Manage Search writes the complete search collection data to an XML file and stores it in the directory location that you specified. You can use this file later as the source of an import operation to import the search collection into another portal.
  7. To return to the previous panel without exporting the search collection, click the appropriate link in the breadcrumb trail.

Importing a search collection

To import the data of a search collection, proceed as follows:
  1. Before you can import the collection data, you need to create the empty shell for the search collection. You can create the empty shell by Creating a search collection. You need to enter only the mandatory data entry field Location of Collection. Do not add content sources or documents, as that is completed by the import.
  2. On the search collection list, locate the search collection into which you want to import the search collection data.
  3. Click the Import or Export icon next to the search collection in the list. Manage Search displays the Import and Export Search Collection panel.
  4. In the entry field Specify Location (full path with XML extension):, type the full directory path and XML file name of the search collection data, which you want to import into the selected search collection.
  5. Click Import to import the complete search collection data from the specified XML file into the selected search collection.
  6. To return to the previous panel without importing a search collection, click the appropriate link in the breadcrumb trail.
  7. If required, you can now add content sources and documents to the search collection.
Note: When you import a collection, be aware of the following information:
  1. Import collection data only into an empty collection. Do not import collection data into a target collection that contains content sources or documents.
  2. When you import collection data into a collection, all settings are overwritten by possibly imported settings. For example, the language setting is overwritten, or a summarizer is added, if it was specified for the imported search collection.
  3. When you import a collection, a background process fetches, crawls, and indexes all documents that are listed by URL in the previously exported file. This process is asynchronous and can take considerable time until the documents become available.
  4. When you import a collection that contains a portal site content source that was created in a previous version, you need to complete the following actions:
    • Regather the content by deleting the existing site content source
    • Create a site content source
    • Start a crawl

Refreshing collection data

Refreshing the data of a search collection updates that collection by renewed crawling of all the content sources that are associated with it. To refresh a search collection, click Regather documents from Content Source. Manage Search does complete new crawls over all its content sources. To verify progress and completion of the regathering, click the collection and view the Collection Status information.
Note: This action might require a considerable amount of system resources, as all content sources of the search collection are crawled at the same time.

Deleting a search collection

Note: If you delete the search collection before an upgrade to a higher version of HCL, make sure you export the search collection for later import before you delete it. For details, refer to Migrating search collections.

Managing the content sources of a search collection

To work with the content sources of a search collection, click the collection name in the list of search collections. Manage Search lists the Content Sources and the Search collection status information of the selected search collection. A search collection can be configured to cover more than one content source.

From the Content Sources panel, you can do the following tasks:
  • Click Refresh to refresh the status information. While a crawl on the content source is running, this option updates the information about the run time and the documents collected so far.
  • View the status information for the content source:
    Documents
    The number of documents in the content source. If you click Refresh during a crawl, this action shows how many documents the crawler fetched so far.
    Run Time
    The Run Time of the last crawler that is run on the content sources. If you click Refresh during a crawl, this action shows how much time the crawler used so far.
    Last Run
    The date and time when the Last Run started by which the content source was crawled.
    Next Run
    The date and time of the Next Run by which the content source is crawled, if scheduled.
    Status
    The Status of the content source, that is, whether the content source is Idle or a crawl is Running.
  • Select one of the icons for a specific content source and do one of the following tasks:
    • View Content Source Schedulers. This icon is displayed only if you defined scheduled crawls for this content source. If you click this icon, the portlet lists the scheduled crawls, together with the following information:
      • Start Date
      • Start Time
      • Repeat Interval
      • Next Run Date
      • Next Run Time
      • Status, an option that can be disabled or enabled.
    • Start Crawler. Click this icon to start a crawl on the content source. This action updates the contents of the content source by a new run of the crawler. During the run, the icon changes to Stop Crawler; you can click to end the run. For details, refer to Starting to collect documents from a content source . Portal Search refreshes different content sources as follows:
      • For website content sources, documents that were indexed before and still exist in the content source are updated. Documents that were indexed before, but no longer exist are retained in the search collection. Documents that are new in the content source are indexed and added to the collection.
      • For HCL Portal sites, the crawl adds all pages and portlets to the content source. It deletes portlets and static pages from the content source that were removed from the portal. The crawl works similarly to the option Regather documents from Content Source.
      • For HCL Web Content Manager sites, Portal Search uses an incremental crawling method. In additions to added and updated content, the Seedlist explicitly specifies deleted content. In contrast, clicking Regather documents from Content Source starts a full crawl; it does not continue from the last session, and it is therefore not incremental.
      • For content sources created with the seedlist provider option, a crawl on a remote system that supports incremental crawling, such as HCL Connections, behaves like a crawl on a Web Content Manager site.
    • Regather documents from Content Source. This option deletes existing documents in the content source from previous crawls. Then, it starts a full crawl on the content source. Documents that were indexed before and still exist in the content source are updated. Documents that were indexed before, but no longer exist in the content source are removed from the collection. Documents that are new in the content source are indexed and added to the collection.
    • Notes:
      • It is beneficial to define a dedicated crawler user ID. The pre-configured default portal site search uses the default administrator user ID wpsadmin with the default password of that user ID for the crawler. If you changed the default administrator user ID during your portal installation, the crawler uses that default user ID. If you changed the user ID or password for the administrative user ID and still want to use it for the Portal Search crawler, you need to adapt the settings.

        To define a crawler user ID, select the Security tab, and update the user ID and password. Click Save.

      • If you modify a content source that belongs to a search scope, update the scope manually to make sure that it still covers that content source. Especially if you changed the name of the content source, edit the scope and make sure that it is still listed there. If not, add it again.
      • If you delete a content source, the documents that were collected remain available for search by users under all scopes. All scopes included the content source before it was deleted. These documents are available until the expiration time. Under General Parameters, you can specify this expiration time from the Links expire after (days): menu.

New content source

When you create a new content source for a search collection, that content source is crawled and the search collection is populated with documents from that content source. You can determine where the index crawls and what information it fetches. Click options from the Content source type menu. Entry fields and parameters that you can specify are as follows:

Web site
Select this option for all remote sites, which includes websites and remote portal sites. Only anonymous pages can be indexed and searched on remote portal sites.
Seedlist provider
Select this option if the crawler uses a seedlist as the content source for the collection.
Portal site
Select this option if the content source is your local portal site.
WCM (Managed Web Content) site
To make a content source of this type available to Portal Search, you need to create it in the Web Content Manager Authoring portlet. You select the appropriate option to make it available for search and specify the search collection to which it belongs. When you complete creating the Managed Web Content site, it is listed among the content sources for the search collection that you specified.
Your selection determines some of the entry fields and options that are available for creating the content source. For example, the option Obey robots.txt under the tab Advanced Parameters is available only if you select Web site as the content source type.

Setting the general parameters for a content source

Set the general parameters for the content source by completing the entry fields and making your selections in the Create a New Content Source box. The available fields and options depend on the type of content source that you select and are listed as follows:
  • Type the required web URL or portal URL in the mandatory Collect documents linked from this URL field. This action determines the root URL from which the crawler starts. For portal content sources, the value for this field is completed by Manage Search.
    Notes:
    • For websites, you need to type the full name that includes http://. For example: http://www.cnn.com. Typing only www.cnn.com results in an error.
    • A crawler failure can be caused by URL redirection problems. If this problem occurs, try by editing this field, for example, by changing the URL to the redirected URL.
  • Make your selection from the following options by selecting from the lists. The available fields and options differ, depending on the type of content source that you selected.
    Levels of links to follow:
    For crawling websites: This option determines the crawling depth that is the maximum number of levels of nested links, which the crawler follows from the root URL.
    Number of linked documents to collect:
    For crawling websites, this option determines the maximum number of documents that are indexed by the crawler during each session. The number of indexed documents includes documents that are reindexed as their content changed.
    Stop collecting after (min):
    This option sets the maximum number of minutes the crawler might run in a single session for websites.
    Note: The timeout works as an approximate time limit. It might be exceeded by some percentage.
    Stop fetching document after (sec):
    This option indicates the time that the crawler spends trying to fetch a document. This option sets the maximum time limit in seconds for completing the initial phase of the HTTP connection that is for receiving the HTTP headers. This time limit must be finite as it is used to prevent the crawler from getting stuck infinitely on a bad connection. However, it allows the crawler to fetch large files, such as compressed files, which take a long time to fetch.

Setting the advanced parameters for a content source

When you create a new content source, click the Advanced Parameters tab and select from the following options, check the boxes, or enter data as follows:
Number of parallel processes:
This parameter determines the number of threads the crawler uses in a session.
Default character encoding:
This parameter sets the default character set that the crawler uses if it cannot determine the character set of a document.
Note: The entry field for the Default character encoding contains the initial default value windows-1252, regardless of the setting for the Default Portal Language under Administration menu > Portal Settings > Global Settings. Enter the required default character encoding, depending on your portal language. Otherwise, documents might be displayed incorrectly from Browse Documents.
Always use default character encoding:
If you check this option, the crawler always uses the default character set, regardless of the document character set. If you do not check this option, the crawler tries to determine the character sets of the documents.
Obey Robots.txt
If you select this option, the crawler observes the restrictions that are specified in the file robots.txt when it accesses URLs for documents. This option is only available for the website content source type, not for the Portal site or seedlist provider.
Proxy server:
If you leave this HTTP proxy server value empty, the crawler does not use a proxy server.
Port:
If you leave this Port value empty, the crawler does not use a proxy server.

Configuring the Scheduler

To configure the schedule, click the Scheduler tab to display the following options:
Define Schedule
Add new schedule from this box.
Scheduled Updates
This box shows when scheduled crawls are done.
Note: The time interval between the crawler runs must be more than the maximum crawler execution time. A crawler cannot be started if it is running. If a crawler job is started while the crawler is running, this execution is ignored and the crawler is only started at the next scheduled time.

Configuring the Filters

The crawler filters control the crawler progress and the type of documents that are indexed and cataloged. To configure filters, click the Filters tab. You can define new filters in the Define Filter Rules box. The defined filters are listed in the Filtering Rules box.

Crawler filters are divided into the following two types:
URL filters
These filters control which documents are crawled and indexed, based on the URL where the documents are found.
Type filters
These filters control which documents are crawled and indexed, based on the document type.

If you define no filters, all documents from a content source are fetched and crawled. If you click Include filters, only those documents that pass the included filters are crawled and indexed. If you click Exclude filters, they override the included filters. If you define no included filters, they limit the number of documents that are crawled and indexed. More specifically, if a document passes one of the included filters, but also passes one of the excluded filters, it is not crawled, indexed, or cataloged.

You can do the following tasks with the Filters box:
Creating a filter
When you use the option Apply rule while: Collecting documents with Rule type: Include, make sure that the URL in the field Collect documents linked from this URL: fits the specified rule; otherwise, no documents are collected. For instance, crawling the URL http://www.ibm.com/products with the URL filter */products/* does not give any results because the rule has a trailing slash, but the URL does not. But either crawling http://www.ibm.com/products/ with the URL filter */products/* (both with trailing slash), or crawling http://www.ibm.com/products with the URL filter */products* (no trailing slash) works.

Configuring security for a content source

You can configure the security for indexing secured content sources and repositories that require authentication. Click the Security tab to display the following two boxes:
Define security realm
This box is used to add new secured content sources.
Security realms
This box displays a list of existing security realms.
In the Define security realm box, enter the following data entry fields:
User Name
Enter the user ID with by which the crawler can access the secured content source or repository.
Password
Enter the password for the user ID that you completed for the user name.
Host name
Enter the name of the server. For Portal sites and seedlist providers, this entry is not required. If you leave it blank, the host name is inferred from the provided root URL.
Realm
Enter the realm of the secured content source or repository.

Starting to collect documents from a content source

To start an update from a content source manually, proceed by the following steps:
  1. Click Start Crawler for the content source for which you want to start a new run of the crawler. The crawler fetches the documents from the selected content source. If they are new or modified, they are updated in the search collection. While a crawl is running, the icon changes to Stop Crawler, which you can click to stop the crawl. Portal Search refreshes different content sources as follows:
    • For website content sources, documents that were indexed before and still exist in the content source are updated. Documents that were indexed before, but no longer exist are retained in the search collection. Documents that are new in the content source are indexed and added to the collection.
    • For HCL Portal sites, the crawl adds all pages and portlets to the content source. It deletes portlets and static pages from the content source that were removed from the portal. The crawl works similarly to the option Regather documents from Content Source.
    • For HCL Web Content Manager sites, Portal Search uses an incremental crawling method. In additions to added and updated content, the Seedlist explicitly specifies deleted content. In contrast, clicking Regather documents from Content Source starts a full crawl; it does not continue from the last session, and it is therefore not incremental.
    • For content sources created with the seedlist provider option, a crawl on a remote system that supports incremental crawling, such as HCL Connections, behaves like a crawl on a Web Content Manager site.
  2. To view the updated status information about the progress of the crawl process, click Refresh. The following status information is updated:
    Documents
    Shows how many documents the crawler fetched so far from the selected content source.
    Run time
    Shows how much time the crawler used so far to crawl the content source.
    Status
    Shows whether the crawler for the content source is running or idle.

Verifying the address of a content source

To verify the URL address of a content source, locate the content source and click the Verify Address icon.

If the web content source is available and not blocked by a robots.txt file, Manage Search returns the message Content Source is OK. If the content source is invalid, inaccessible, or blocked, Manage Search returns an error message.

When you create a new content source, Manage Search starts the Verify Address feature.

Search Scopes and Custom Links 

From Search Scopes, you can view and manage search scopes and custom links. The search scopes are displayed to users as search options in the drop-down list of the search box in the banner and in the Search Center portlet. Users can select the scope relevant for their search queries. You can configure scopes by one of the following ways:
  • One or more search locations, or content sources.
  • Document features or characteristics, such as the document type.
HCL includes these scopes:
All Sources
This scope includes documents with all features from all content sources in the search.
Managed Web Content
This scope restricts the search to sites that were created by Web Content Manager.

You can add your own custom search scopes and an icon for each scope. Your icons are placed in the list of scopes.

You can also add new custom links to search locations. This custom link includes links to external web locations, such as Google or Yahoo. The Search Center global search lists the custom links in the selection menu of search options.

Managing Search Scopes and Custom Links

From the Search Scopes and Custom Links panel, select the following options or icons to complete tasks on search scopes and custom links:
New Scope
Click this option to create a new search scope. For details, refer to Creating a new search scope.
Refresh
Click this option to refresh to update the information for the scopes, for example, the status of scopes, or updates that another administrator made on scopes.
Move Down and Move Up arrows
Click these arrows to move search scopes up and down in the list. This action determines the sequence by which the scopes are listed in the menu from which users select search options for searches with the Search Center portlet.
Edit Search Scope
Click this icon to work with or modify a search scope. For details, refer to Editing a search scope.
Delete Search Scope
Click this icon to delete a search scope.
New Custom Link
Click this option to add new custom link. For details, refer to Adding a new custom link.
Edit Custom Link
Click this icon to work or modify a custom link.
Delete Custom Link
Click this icon to delete a custom link.
Note: You must clear the browser cache to see changes, for example, a new scope, or the new default scope that is displayed in the correct position.

Creating a search scope

To create a new search scope, click the New Scope to display the New Search Scope page. Enter the required data in the fields and select from the available options:
Scope Name
A mandatory field where you enter a name for the new search scope. The name must be unique within the current portal or virtual portal.
Description
An optional field where you can describe the scope.
Custom Icon URL:
Enter the URL location where the portal can locate the scope icon that you want to be displayed with the search options. If the icon file exists in the default icon directory wps/images/icons, you need to type only the icon file name. If the icon file is in a different directory path, type the absolute file path with the file name. Click Check icon path to ensure that the icon is available at the URL you specified.
Status:
Set the status of the search scope as you require. To make the scope available to users, set the status to Active.
Visible to anonymous users:
Select Yes to make the search scope available to users who use your portal without logging in. Select No to make the scope available to authenticated users only.
Query text (optional):
Enter a query text that is invisibly appended to all searches in this scope. Searches return results that match both the user search and the query text that you enter in this field. Both sets of results are weighted with the same relevance in the result list. The query text that you enter must conform to the syntax rules of entering a query in the Search Center. For more information about these query syntax rules, see the Search Center portlet help.
Select Locations
Select the location as required. Only documents from these search locations or content sources are searched when users select this scope for their search.
Note: The location tree also shows content sources that are deleted if they still contain documents in the collection. After a deleted content source has no documents, the cleanup daemon removes it from the location tree.

To set names and descriptions for the search scope, you must create and save the scope first. Then, locate the scope on the scopes list and edit the scope by clicking the Edit icon. The option for setting names and descriptions in other locales is available only on the Edit Search Scope page.

Note: If you modify a content source that belongs to a search scope, update the scope manually to make sure that the scope still covers that content source. Especially if you changed the name of the content source, edit the scope and make sure that it is still listed there. If not, add it again.