Modifying file attachment indexing

Administrators can configure indexing processes for file attachments at the database and file levels.

When the full-text subsystem processes a database it needs to answer one question at the database level and two at the document attachment level:
  • Should attachments be indexed for this database?
  • Should the particular attachment under examination be indexed?
  • How will text be retrieved from this particular attachment?

Database-level controls

The following INI values can be set to control attachment indexing for every database, server-wide:

  • FT_INDEX_ATTACHMENTS=1

    Index attachments for every indexed database, even if that option was not chosen by the database manager. Additionally, filtering will never be performed on the attachments, only brute force text-stripping.

  • FT_INDEX_ATTACHMENTS=2

    Never index attachments for any indexed database, even if the database manager chose that option.

  • FT_INDEX_ATTACHMENTS=3

    Index attachments for every indexed database, even if that option was not chosen. The difference from FT_INDEX_ATTACHMENTS=1 is that filtering will be performed on attachments when applicable, and brute force text-stripping will be used based on the brute force list of file extensions.

File-level controls

There are two coarse-grained devices that can be used to control whether a particular attachment is a candidate for indexing or not: the ignore list (enabled by default) and the white list (must be explicitly enabled). Both lists can be extended beyond their defaults and the white list can be entirely substituted if desired.

If an attachment file's extension matches an item in the ignore list, then it will typically not be indexed.

If an attachment file's extension matches an item in the white list, then it will always be indexed. If without match, it will not be indexed.

If the extensions in the ignore list and white list collide, then the white list takes precedence.

Each list has the following default file extensions:
  • Ignore list

    *.ap, *.au, *.bkf, *.bqy, *.cab, *.cca, *.dbd, *.dll, *.exe, *.gif, *.gz, *.img, *.jar, *.jpg, *.lwp, *.m4p, *.m4v, *.MIF, *.mov, *.mp3, *.mp4, *.mpg, *.msi, *.nsf, *.ntf, *.p7m, *.p7s, *.pag, *.pdb, *.pic, *.png, *.pst, *.rar, *.shw, *.sys, *.tar, *.tar, *.tif, *.wav, *.wmf, *.wpl, *.wq1, *.z, *.zip

  • White list

    *.123, *.ami, *.as, *.aw, *.dca, *.doc*, *.dwg, *.emf, *.emz, *.fff, *.fft, *.flg, *.fm, *.htm*, *.hwp, *.jar, *.jtd, *.jtt, *.mime, *.oas, *.odp, *.ods, *.odt, *.pdf*, *.ppt*, *.qpw, *.r13, *.r14, *.rtf, *.sam, *.swp, *.vsd*, *.wk4, *.wks, *.wp*, *.wri, *.xlr, *.xls*, *.xml, *.xy*, *.zip

To modify the ignore list, white list, and other indexing processes, refer to the following actions:

Extending the ignore list

The ignore list can be expanded to exclude specific types of document attachments in addition to the default types. To do so, set the FT_INDEX_IGNORE_ATTACHMENT_TYPES notes.ini by listing file type extensions with a wildcard character (*), separated by commas, using no space characters. For example:
FT_INDEX_IGNORE_ATTACHMENT_TYPES=*.asf,*.avi,*.bin,*.bmp,*.dat,*.iso,*.mpeg,*.ogg,*.qz,*.rm,*.so,*.swf,*.wmv 
This example results in the following full set of excluded attachments: *.ap, *.asf, *.au, *.avi, *.bin, *.bkf, *.bmp, *.bqy, *.cab, *.cca, *.dat, *.dbd, *.dll, *.exe, *.gif, *.gz, *.img, *.iso, *.jar, *.jpg, *.lwp, *.m4p, *.m4v, *.MIF, *.mov, *.mp3, *.mp3, *.mpeg, *.mpg, *.msi, *.nsf, *.ntf, *.ogg, *.p7m, *.p7s, *.pag, *.pdb, *.pic, *.png, *.pst, *.qz, *.rar, *.rm, *.shw, *.so, *.swf, *.sys, *.tar, *.tif, *.wav, *.wmf, *.wmv, *.wpl, *.wq1, *.z, *.zip
Note: FT_INDEX_IGNORE_ATTACHMENT_TYPES has a 256-character limit. If the value of the file types to exclude exceeds this limit, you can use the additional settings FT_INDEX_IGNORE_ATTACHMENT_TYPES2 and FT_INDEX_IGNORE_ATTACHMENT_TYPES3.

Enabling the white list

The white list has two modes and, respectively, two .inis to enable those modes:
  • FT_USE_ATTACHMENT_WHITE_LIST=1 setting enables the default white list, which has the default file extensions listed earlier in this document. You can append to this default list using Extending the white list.
  • FT_USE_MY_ATTACHMENT_WHITE_LIST=1 setting discards the default list and exclusively references FT_INDEX_FILTER_ATTACHMENT_TYPES as documented in Extending the white list.
Extending the white list

The white list can be expanded in a similar fashion to the ignore list. To do so, set the FT_INDEX_FILTER_ATTACHMENT_TYPES notes.ini by listing file type extensions with a wildcard character (*), separated by commas, using no space characters.

Additionally, FT_INDEX_FILTER_ATTACHMENT_TYPES_MAX_MB is a companion setting that enforces an upper limit on the size of files included in the white list. It accepts an integer value representing mebibytes (MiB).

Overriding the white list

Set FT_USE_MY_ATTACHMENT_WHITE_LIST=1 along with FT_INDEX_FILTER_ATTACHMENT_TYPES to exclusively use a custom list of files to be indexed.

Note: In the case of FT_USE_MY_ATTACHMENT_WHITE_LIST, if FT_INDEX_FILTER_ATTACHMENT_TYPES is not set it will result in no file attachments indexed for any database on the server.

Extending the white list for a particular database

Whichever white list is in effect on the system can be additionally extended for a specific database via the setting FT_INDEX_FILTER_ATTACHMENT_TYPES_<database replica id>. The white list in effect can either be the default, or extended or replaced via FT_INDEX_FILTER_ATTACHMENT_TYPES,

Also, any attachment file types appearing in this list can be size-capped by specifying the setting FT_INDEX_FILTER_ATTACHMENT_TYPES_MAX_MB_<database replica id> if so desired.

Controlling text retrieval

Once the full-text subsystem has determined that an attachment will be indexed, the next decision is how to extract text from that file attachment. Two methods exist: an intelligent parser (Tika) and ASCII text-stripping.

By default, files are sent to the intelligent parser unless the file extensions are explicitly listed in the ASCII text-stripping list. While the intelligent parser typically returns more relevant text tokens to the indexer, it is slower than raw ASCII text-stripping. Text-stripping, however, can result in many more superfluous tokens – such as text formatting elements and the like – to be returned to the indexer, which may decrease search accuracy.

Note: In the case where the majority of file attachments contain predominantly non-ASCII characters, it's advisable to force all file attachments through the intelligent parser.

The following are the ASCII text-stripping default list of file extensions:

*.ans,*.ascii,*.log,*.out,*.sms,*.text,*.txt,*.uni,*.utxt

Extending the ASCII text-stripping list

Similar to the ignore list and white list, the text-stripping list can be extended by adding entries via FT_INDEX_BRUTE_FORCE_ATTACHMENT_TYPES notes.ini. Again, list file type extensions with a wildcard character (*), separated by commas, using no space characters.

Overriding the ASCII text-stripping

Stating in Domino 14, you can set FT_USE_MY_ATTACHMENT_BRUTE_LIST=1 along with FT_INDEX_BRUTE_FORCE_ATTACHMENT_TYPES to exclusively use a custom list of files to be text-stripped.

Disabling the ASCII text-stripping

Set FT_DISABLE_BRUTE_FORCE=1 to prevent sending attachments through ASCII text-stripping.

Disabling attachment file name indexing

By default, both the intelligent parser and the ASCII text-stripper record the name of the file from which text is retrieved. If you prefer that users not search for attachment file names, then use the DISABLE_ATTACHMENT_SEARCH_BY_FILENAMES=1 setting.