Specifying the SAMPLING SIZE

In MEDIUM mode, you can optionally use the SAMPLING SIZE keywords to specify the minimum number of rows to sample for calculating column distribution statistics. If the Resolution clause omits the RESOLUTION keyword and specifies no confidence level and no percent value, then the number of rows that HCL OneDB™ samples will be the larger of the following two values:
  • The min value that you specify immediately after the SAMPLING SIZE keywords
  • The sampling size that is required for the default percent of rows in each bin (2.5%) and for the minimum confidence level (0.80).
If a sampling size is specified in a Resolution clause that includes explicit values for both the average percent of sampled rows per bin and for the confidence level, then the number of sampled rows will be the larger of these two values:
  • The min value that you specify immediately after the SAMPLING SIZE keywords
  • The sampling size that is required for the specified percent of rows and for the specified confidence level.

If a sampling size is specified in a Resolution clause that includes an average percentage value but sets no confidence level, then the minimum confidence value of 0.80 is used to calculate the actual sampling size for HCL OneDB to use if the specified size is smaller.

For example, the following statement calculates statistics for two columns of the customer table, without updating index information. At least 200 rows will be sampled, but the actual size of the sample might be larger than 200 if more rows are required to provide the default 0.80 confidence level for a sample distribution that uses approximately 50 equivalence categories, with an average percentage of 2% of the sampled values in each bin.
UPDATE STATISTICS MEDIUM FOR TABLE customer (city, state) 
   SAMPLING SIZE 200 RESOLUTION 2 DISTRIBUTIONS ONLY;

Whether or not you include an explicit SAMPLING SIZE specification in the Resolution clause, HCL OneDB records in the system catalog the actual sampling size (as a percentage of the total number of rows in the table) at the time of MEDIUM mode UPDATE STATISTICS creation.