Investigating file content search issues

An overview of file content search issues.

Services covered by file content indexing

The following IBM® Connections services are covered for file content indexing:
  • Files in the Files service
  • Wikis attachments
  • Activities attachments
  • Forums attachments
  • CCM files

Indexing schedule

Index scheduling is as follows:
  • Metadata from all files is indexed as part of the regular 10/15 minute indexing schedule.
  • File content extraction is handled by a separate process on its own schedule. Therefore, it might be up to 50 minutes after upload time before a file can be searched by its content.

Supported file types

search-config.xml defines the file types that are handled for file content indexing:
 
<mimeType name="application/msword" processor="" />  
<mimeType name="application/vnd.ms-excel" processor="" />  
<mimeType name="application/vnd.ms-powerpoint" processor="" />  
<mimeType name="application/vnd.visio" processor="" />
<mimeType name="application/vnd.ms-project" processor="" />  
<mimeType name="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" processor="" />  
<mimeType name="application/vnd.openxmlformats-officedocument.presentationml.presentation" processor="" />  
<mimeType name="application/vnd.openxmlformats-officedocument.wordprocessingml.document" processor="" />  
<mimeType name="application/pdf" processor="" /> 
<mimeType name="application/postscript" processor="" />
<mimeType name="application/xhtml+xml" processor="" /> 
<mimeType name="application/xml" processor="" />
<mimeType name="text/html" processor="" />  
<mimeType name="text/htm" processor="" />  
<mimeType name="text/plain" processor="" />  
<mimeType name="text/richtext" processor="" />  
<mimeType name="text/xml" processor="" />  
<mimeType name="application/rtf" processor="" />  
<mimeType name="application/vnd.oasis.opendocument.text" processor="" />  
<mimeType name="application/vnd.oasis.opendocument.spreadsheet" processor="" />  
<mimeType name="application/vnd.oasis.opendocument.presentation" processor="" />  
<mimeType name="application/vnd.oasis.opendocument.text-master" processor="" />  
<mimeType name="application/vnd.lotus-1-2-3" processor="" />  
<mimeType name="application/vnd.lotus-wordpro" processor="" />  
<mimeType name="application/vnd.lotus-freelance" processor="" /> 
You can disable indexing of any of these file types by removing that entry from search-config.xml.

General file content indexing switches

You can disable all file content indexing by removing all file type entries from search-config.xml.

You can also temporarily disable file content indexing by disabling the 20-minute file content retrieval scheduled task.

Configuration issues

Some post-install steps are required to configure the file extraction tools that are used during indexing. Logged errors when the index process extracts files indicate a configuration issue. For more information, see Linux: Troubleshooting when files content is not found after searching.

File size cutoff

The search-config.xml file size cutoff maxAttachmentSize is a configuration setting for the maximum size of files that can have content that is indexed. Any file that exceeds the cutoff size is not indexed. By default this configuration is set to 52 MB.

Limit on indexed text

The search-config.xml limit on indexed text maxAttachmentSize is a configuration setting that limits the amount of extracted text that is indexed for a file. This limit prevents large files from adversely affecting search relevancy by pushing down smaller more relevant files in the search results. This limit is configurable and the default is 200 KB of extracted text.

Unsupported files

The following files are never indexed:
  • Encrypted files
  • Password protected files
  • Corrupted files of any type

Searching files content for accented characters

Searching file content for accented characters works for all the supported file types, except in the case of .txt files that do not have UTF-8 encoding. For example, if a .txt file has ANSI encoding, then any accented characters it contains are not found by a full text search. To resolve this, save the file using UTF-8 encoding and then upload it again.