Data integrity issues

Ideally, a dimension is configured to capture all meaningful values. In practice, however, it may not be possible. Whether you are capturing whitelisted values only or whitelist + observed values, there are data integrity issues that should be considered.

If you configured a Whitelist Only dimension, the whitelist is the set of all values. All values that are detected in the capture stream and do not appear in the whitelist are identified as [Others] values in the data set. You can still perform data analysis, but detailed analysis on the individual [Others] values is not possible.

For Whitelist + Observed Values, data integrity is a bit more complicated. Suppose your dimension is configured to capture 1000 values per hour in a one-Canister environment, and you are interested in two values: Value A and Value B.

  • Neither value is stored in the whitelist.

Recorded values are indicated below. The Detected columns indicate the number of values that were captured for the dimension before the value instance was detected; if the number of values is over 1000, then by default the recorded value is [Others].

Table 1. Data integrity issues
Hour Value A Detected # Value A Recorded Value B Detected # Value B Recorded
1 100 Value A 200 Value B
2 100 Value A 1200 [Others]
3 1100 [Others] 200 Value B
4 1100 [Others] 1200 [Others]

According to the data, Value A occurred only in Hours 1 and 2, while Value B occurred in Hours 1 and 3.

However, if both values are added to the whitelist, then they are detected and recorded every hour.

Note: Whether you are using Whitelist Only or Whitelist + Observed Values, it is important to review and update your whitelists regularly to maintain data integrity and to limit the volume of captured data.