To speed up the indexing process, you can use a SearchService
command that extracts file content in a process that is separate from
indexing.
Before you begin
To use SearchService administrative commands,
you must use the IBM® WebSphere® Application Server
wsadmin client. See Starting
the wsadmin client for details.
About this task
The SearchService.startBackgroundFileContentExtraction command extracts
file content outside of the indexing process. This command iterates over the persisted files
seedlists and, for each file it extracts the file content according to the specified configuration
settings. This process is multithreaded, and is the same file content extraction process that occurs
when you run the startBackgroundIndex command.
Procedure
To extract file content outside of the indexing process,
complete the following steps.
-
Start the wsadmin client from one of the following directories on the system on which you
installed the Deployment Manager:
Linux: app_server_root\profiles\dm_profile_root\bin
Windows:
app_server_root/profiles/dm_profile_root/bin
where
app_server_root is the WebSphere
Application Server installation directory and
dm_profile_root is the Deployment
Manager profile directory, typically dmgr01.
You must start the client from this directory or
subsequent commands that you enter do not execute correctly.
- After the wsadmin command environment
has initialized, enter the following command to initialize the Search
environment and start the Search script interpreter:
execfile("searchAdmin.py")
If prompted to specify a service to connect to, type 1 to pick
the first node in the list. Most commands can run on any node. If
the command writes or reads information to or from a file using a
local file path, you must pick the node where the file is stored.
When
the command is run successfully, the following message displays:
Search Administration initialized
- Use the following command:
Extracts file content for all files that are referenced in the persisted seedlists in a process
that is independent of the indexing task.
This
command takes the following parameters:
- persistence dir
- A string that specifies the location of the persisted files seedlists.
- components
- A string that specifies the application or applications for which
you want to extract file content. The following values are valid:
- files - extracts file content from the Files app.
- wikis - extracts file content from the Wikis app.
- activities - extracts file content from the Activities app.
- forums - extracts file content from the Forums app.
- ecm_files - extracts file content from community library files that are stored in Enterprise
Content Management systems.
- extracted text dir
- A string that specifies the target location for the extracted
text. The same directory structure and naming scheme is used for this
directory as for the extracted text directory on the deployment: connections
shared data/ExtractedText. For example, ExtractedText/121/31/36cdb7a0-92b2-4cf9-91f3-c4e7e527a5e1.
- thread limit
- The maximum number of seedlist threads.
For example:
SearchService.startBackgroundFileContentExtraction("/bg_index/seedlists", "files", "/bg_index/extractedText", 10)
You typically run this command after you run a startBackgroundCrawl command to
act on up-to-date seedlists. If there are no persisted seedlists available, the behavior is the same
as when you run the startBackgroundCrawl command, that is, the seedlists are crawled and persisted
first.
- Verify that the target extracted text directory is populated
with the extracted files content.
Open some of the extracted
text files in a text editor. You can expect to see the typical format,
for example, some header information followed by the extracted content.
What to do next
- Copy the extracted file content to the directory specified by the WebSphere Application Server environmental variable EXTRACTED_FILE_STORE.
Storing the extracted file content in this directory means that when the Search application next
detects a file update during indexing. If the update is a metadata change only, Search can avoid
converting the file again unnecessarily. For more information about the EXTRACTED_FILE_STORE
variable, see WebSphere Application Server
environment variables.
- Complete the steps that are outlined in, Creating a background index to create a
background index by using the extracted file content.