Soundex analyzer

The Soundex analyzer uses the Soundex algorithm to convert words into four-character codes based on the English pronunciation of their consonants.

Vowel sounds are not included unless the vowel is the first letter of the word. Additional sounds beyond the first four phonetic sounds are ignored. If a word has fewer than four phonetic sounds, zeros are used to complete the four-character codes. The Soundex analyzer is the similar to the eSoundex analyzer except that it uses four characters in its codes, regardless of the length of the word. The Soundex analyzer is useful if you want to search text based on how the beginnings of words sound. Because the text is converted to codes, you cannot perform proximity and range searches or specify a thesaurus.

The Soundex analyzer processes text characters in the following ways:

  • Stopwords are not indexed.
  • Numbers and special characters are ignored.
  • The colon (:) character is treated as a whitespace, so that characters on either side of it are considered separate words.

Examples

In these examples, the input string is shown on the first line and the resulting tokens are shown on the second line, each surrounded by square brackets. All codes consist of four characters.

In the following example, the words "the" are not converted to tokens because they are stopwords and the rest of the words are converted to Soundex codes that begin with the first letter of the word:

The Quick Brown Fox Jumped Over The Lazy Dog
[q200] [b650] [f200] [j513] [o160] [l200] [d200]

In the following example, the colon is treated as a whitespace and the backslash is ignored:

c:/informix 
[c000] [i516]

In the following example, the ampersand is ignored:

XY&Z Corporation 
[x200] [c616]

In the following example, the e-mail address is considered one word:

xyz@example.com
[x225]

In the following example, numbers are ignored:

1abc 12abc abc1 abc12
[a120] [a120] [a120] [a120]

In the following examples, three words with the same stem word have the same code:

accept
[a213]
acceptable
[a213]
acceptance
[a213]