Standard analyzer

The Standard analyzer removes stopwords and indexes words, numbers, and some special characters. The Standard analyzer is the default analyzer.

The Standard analyzer processes text characters in the following ways:

  • Stopwords are not indexed.
  • Converts alphabetical characters to lower case.
  • Ignores colons, #, %, $, parentheses, hyphens, and slashes.
  • Indexes underscores, @, and & symbols when they are part of words or numbers.
  • Separately indexes number and words if numbers appear at the beginning of a word.
  • Indexes numbers as part of the word if they are within or at the end of the word.
  • Indexes apostrophes if they are in the middle of a word, but removes them if they are at the beginning or end of a word.
  • Ignores an apostrophe followed by the letter s at the end of a word.

Examples

In these examples, the input string is shown on the first line and the resulting tokens are shown on the second line, each surrounded by square brackets.

In the following example, stopwords are removed and the words are converted to lower case:

The Quick Brown Fox Jumped Over The Lazy Dog
[quick] [brown] [fox] [jumped] [over] [lazy] [dog]

In the following example, the apostrophe at the beginning of a word and the apostrophe followed by an s are ignored, but the apostrophe in the middle of a word is indexed:

Prequ'ile Mark's 'cause 
[prequ'ile] [mark] [cause]

In the following example, the colon and backslash are ignored:

c:/informix 
[c] [informix]

In the following example, the ampersand is indexed as part of the company name:

XY&Z Corporation 
[xy&z] [corporation]

In the following example, the e-mail address is indexed as is:

xyz@example.com
[xyz@example.com]

In the following example, numbers at the beginning of the words are separated into different tokens, while numbers at the end of words are included in a single token:

1abc 12abc abc1 abc12
[1] [abc] [12] [abc] [abc1] [abc12]