CJK analyzer

The CJK analyzer processes Chinese, Japanese, and Korean characters into tokens that are indexed.

The CJK analyzer processes text characters in the following ways:

  • Transforms the character sets to UTC-4. Half-width and full-width forms are converted so that they have equivalent characters. For example, fullwidth_digit_zero and digit_zero are treated as the same character.
  • Indexes Chinese, Japanese, and Korean characters in overlapping pairs.
  • Indexes Latin alphabetic, numeric, and the special characters _, +, and #.
  • Stopwords are not indexed.
  • Does not process supplementary code points if the analyzer name is cjk,
  • Processes supplementary code points as surrogate pairs if the analyzer name is cjk.ws,

Examples

In the following example, the first line shows the input string, in which C1, C2, C3 and C4 represent Chinese, Japanese, or Korean characters. The second line shows the resulting tokens, each surrounded by square brackets:

sailC1C2C3C4boat
[sail] [C1C2] [C2C3] [C3C4] [boat]