Document scoring for fuzzy searches

When you perform a fuzzy search, some of the rows that the text search engine returns might satisfy search criteria better than others. To determine the degree of similarity between your clue and each of the rows returned, you can instruct the search engine to assign a value, called a document score, to each of the rows.

The module uses the following two rules when it assigns document scores:
  • It always scores exact matches slightly higher than pattern matches.
  • It scores all pattern matches equally, even if some matches appear to approximate the clue better than others.

Document scores vary from 0 to 100, with 0 indicating no match and 100 indicating an exact match. Values 0 - 100 indicate approximate matches; the higher the value, the closer the match. A null value indicates that the row was not ranked. This could happen if the row was returned because of a non-etx_contains() operator in a WHERE clause that contains an OR predicate.

To access document score information, use a statement local variable (SLV) as the optional third parameter to the etx_contains() operator. The data type of the SLV is etx_ReturnType, a row type derived by HCL OneDB™ that consists of two fields. The scoring information is contained in the score field.

The score field contains a numeric value that indicates the relevance of a returned document to the search criteria, compared to that of other indexed records. The higher the document score value, the more closely the document matches the criteria.

You can use the score field to order the returned rows according to how closely documents match the search criteria. For example, if rc is an SLV, you can use it to obtain document score information, as shown in the following example:
SELECT rc.score, id, description FROM videos
    WHERE etx_contains(description,
    Row('multimedia' , 
        'PATTERN_TRANS & PATTERN_SUBS'), rc # etx_ReturnType)
    ORDER BY 1;
The following figure displays the resulting hitlist when the query in the example is run on the videos table. The column score in the hitlist contains scoring information.
Figure 1: Sample text search query that uses SLVs

Shows the use of the PATTERN_TRANS and PATTERN_SUBS tuning parameters to search for the keyword "multimedia" and the use of an SLV to obtain scoring information. The hitlist from the videos table is the rows in ranked order with "multimedia, showing a score of 100.00 and "mulitmedia" and "multymedia" each showing a score of 99.00.

The SLV has a scope that is limited to the statement in which you use it. It is a way for the Excalibur text search engine to send back information about the search it just performed to the etx_contains() operator that called it.

Although the data type of the SLV is always etx_ReturnType, you must still explicitly specify its type when you use it in the etx_contains() operator, as shown in the example. The example also shows how you can use the ORDER BY clause to instruct the database server to rank the rows it returns by the score field of the SLV.

For more information about the etx_ReturnType data type, its two fields, and how to use the information contained in the fields, see The etx_ReturnType data type. For more general information about SLVs, see the HCL OneDB Guide to SQL: Syntax.