Data integrity issues
Ideally, a dimension is configured to capture all meaningful values. In practice, however, it may not be possible. Whether you are capturing whitelisted values only or whitelist + observed values, there are data integrity issues that should be considered.
If you configured a Whitelist Only
dimension,
the whitelist is the set of all values. All values that are detected
in the capture stream and do not appear in the whitelist are identified
as [Others]
values in the data set. You can still perform
data analysis, but detailed analysis on the individual [Others]
values is not possible.
For Whitelist + Observed Values
, data integrity
is a bit more complicated. Suppose your dimension is configured to
capture 1000 values per hour in a one-Canister environment, and you
are interested in two values: Value A and Value B.
- Neither value is stored in the whitelist.
Recorded values are indicated below. The Detected columns indicate
the number of values that were captured for the dimension before the
value instance was detected; if the number of values is over 1000,
then by default the recorded value is [Others]
.
Hour | Value A Detected # | Value A Recorded | Value B Detected # | Value B Recorded |
---|---|---|---|---|
1 | 100 | Value A |
200 | Value B |
2 | 100 | Value A |
1200 | [Others] |
3 | 1100 | [Others] |
200 | Value B |
4 | 1100 | [Others] |
1200 | [Others] |
According to the data, Value A
occurred only in
Hours 1 and 2, while Value B
occurred in Hours 1
and 3.
However, if both values are added to the whitelist, then they are detected and recorded every hour.
Whitelist Only
or Whitelist + Observed Values
, it is important to review and update your whitelists regularly to maintain data integrity and to limit the volume of captured data.