The etx_CreateCharSet() routine

The etx_CreateCharSet() routine creates a user-defined character set.

Syntax

etx_CreateCharSet (charset_name, file_name)
Element Purpose Data type
charset_name Name of your user-defined character set. If you enter a name longer than 18 characters, the name is silently truncated to 18 characters. CHAR (18)
file_name Absolute path name of the operating system file from which the text search engine loads the character set. The file can be on either the server or the client machine. The client machine is searched first. LVARCHAR

Return type

None.

Usage

When you create an etx index on a column, you can specify the character set used to index the text data. The character set indicates which letters are to be indexed; any characters in the text data that are not listed in the character set are converted to blanks. Use the CHAR_SET index parameter to specify the name of the character set. You must create a user-defined character set before you use it to create an etx index.

The module provides three built-in character sets: ASCII, ISO, and OVERLAP_ISO. Each of these built-in character sets includes only alphanumeric characters and maps lowercase letters to uppercase. This is sufficient for most text searches. For a complete description of the three built-in character sets, see Character sets.

There are times, however, when you might want to index nonalphanumeric characters or distinguish between lowercase and uppercase letters. In these cases, you must define your own character set.

To define your own character set, first create an operating system file that specifies the characters you want to index. The next section describes in detail the structure of this operating system file.

Then create the character set by executing the etx_CreateCharSet() routine. The routine takes two parameters: the name you give the user-defined character set and the full path name of the operating system file that contains the characters to be indexed. The new user-defined character set is stored in the default sbspace.
Restriction: You cannot use the keywords ASCII, ISO, or OVERLAP_ISO (in any combination of uppercase and lowercase letters) as names for your user-defined character set, since these keywords are reserved for the built-in character sets.

To use the user-defined character set, specify its name in the CHAR_SET index parameter of the CREATE INDEX statement.

Structure of the operating system character set file

The operating system file consists of 16 lines of 16 hexadecimal numbers, plus optional lines that contain comments. Each position corresponds to an ASCII character. If you want the character in the position to be indexed, enter the hexadecimal value of the character. If you do not want the character to be indexed, enter 00.

The ISO 8859-1 table in Character sets lists the ISO 8859-1 character set that can be used as a reference when creating the operating system file.

Comments begin with a slash, a hyphen, or a pound sign, and they can appear anywhere in the file.

For example, if you want to create a user-defined character set that indexes hyphens (hexadecimal value 0x2D), underscores (hexadecimal value 0x5F), backslashes (hexadecimal value 0x5C), and forward slashes (hexadecimal value 0x2F), the alphanumeric characters 0 through 9, a through z, and A through Z, and maps the lowercase letters a through z to uppercase, the operating system file would look like the following example:
# Character set that indexes hyphens and 
/ alphanumeric characters. All lower case letters
\ are mapped to upper case.
- Note the different ways of specifying that a 
# line is a comment.
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 2D 00 2F
30 31 32 33 34 35 36 37 38 39 00 00 00 00 00 00
00 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 00 5C 00 00 5F
00 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

This is similar to the built-in ASCII character set, except that hyphens, underscores, forward slashes, and backslashes are also indexed instead of being converted to blanks. These four characters are indexed because the position in the matrix for each character contains its hexadecimal representation: 0x2D, 0x5F, 0x5C, and 0x2F.

All lowercase letters are mapped to uppercase by specifying the uppercase hexadecimal value in the lowercase letter position.

For example, uppercase letter A has a hexadecimal value of 0x41. The position in the matrix of uppercase A contains the hexadecimal value 0x41, thus uppercase A is indexed as uppercase A.

However, the position in the matrix of lowercase a also contains the hexadecimal value 0x41 (which represents uppercase A) instead of the actual hexadecimal representation of lowercase a, 0x61. Thus, lowercase a is mapped to uppercase A, or in other words, lowercase a is indexed as if it were the same as uppercase A. The same is true for all the letters a through z and A through Z.

For more information about the ISO 8859-1 table, refer to ids_excal_144.html#ids_excal_144.

Example

The following example creates a user-defined character set named my_charset:
EXECUTE PROCEDURE etx_CreateCharSet 
    ('my_charset', '/local0/excal/my_char_set_file');

The search engine stores and loads the contents of my_charset from the file called /local0/excal/my_char_set_file on the operating system.