User-defined statistics

User-defined statistics provide a way to improve performance when you compare opaque data type values. User-defined statistics compile information about the values in an opaque data type column that the optimizer can use when it creates a query plan when it needs to execute routines that compare opaque data type values.

Statistics typically consist of the following types of information about the specified column; however, you can collect more information if it is appropriate for your opaque data type:

Minimum value
Maximum value
Distribution of values

When your statistics-gathering function calculates the distribution of column values, it can assign each value to a bin. Each bin contains a range of values. For example, suppose the column values range from 1 - 10. You could have five bins: the first bin would hold values from 1 - 2, the second bin would hold values from over 2 - 4, and so on. The database server generates statistics by calling your statistics-gathering function when you run the UPDATE STATISTICS statement in medium or high mode.

Important: You must understand your data and how users will query it to create meaningful statistics.

The minimum, maximum, and distribution of values can be used to compute the selectivity of a value. The optimizer can then use the selectivity of values when it determines query cost estimates. For example, suppose you want to join two tables. Normally, a join compares all values in one table to all values in the other table. However, if the optimizer knows that one of the tables has low selectivity, it can efficiently order the joins.

Selectivity is an estimate of the percentage of rows that will be returned by a filter in a query. Selectivity values range from 0.0 to 1.0, where 0.0 indicates a selective filter that passes few rows and 1.0 indicates a filter that passes almost all rows. The optimizer uses selectivity information to reorder expressions in the query predicate so that filters that are expensive to call given the values of their arguments are evaluated after filters that are inexpensive to call. Thus the optimizer reduces the number of comparisons and improves performance. To determine the selectivity of a routine, the database server calls the associated selectivity routine.

For example, suppose you have an opaque data type that represents a circle and you have created a distribution for the circle type based on the radius. Assume that the values of the radius range from 5 - 15. If you query for all circles with a radius of less than 4, the selectivity of the LessThan() function that handles the circle data type is 0 because no values qualify. Consequently, the optimizer would not execute the LessThan() function. Alternatively, if you query for all circles with a radius of greater than 4, the selectivity of the GreaterThan() function that handles the circle data type is 1.0 because all values qualify. Consequently, the optimizer would execute the GreaterThan() function after all other operations in the query predicate.

You can define selectivity routines for user-defined functions with the following characteristics:

Functions that compare two opaque data types
Functions that return a Boolean value
Functions that act as filters (called in the WHERE clause of a SELECT statement)

For example, you can define selectivity functions for the Equal(), LessThan(), and GreaterThan() functions that are overloaded for an opaque data type. You can also define a selectivity function for a function like Contains() that compares two opaque data types.

To implement user-defined statistics, you must supply the following routines:

Statistics support functions that collect statistics for opaque data types.
User-defined selectivity routines that use statistics to estimate the selectivity of a routine that compares opaque data type values.

After you define the routines in BladeSmith, you must add code to them to provide the required functionality.

To determine whether your opaque data type needs user-defined statistics, consider the following questions:

Do you know enough about the data and how users will access it to write routines that compile meaningful statistics?
Do the routines that compare your opaque data type consume large amounts of memory or disk space?