Example: Persistent derived field

You might have a Select process configured to choose IDs based on a constraint on a derived field, connected to a Snapshot process to output the selected records that include that derived field. If you mark the derived field to be persistent, the calculated value is passed from the Select process to the Snapshot process.

Another use of persistent derived fields is with any aggregate-type derived field (for example, AVG or GROUPBY). These aggregated fields are calculated based on the multiple rows of data within the current cell, so the value of these aggregated fields changes as the contents of the cell change. With persistent derived fields, you can choose to keep the original calculated value, then carry that to other processes. If you choose to recalculate the derived field instead, you get a calculated value based on the remaining records in the current cell.

When a process takes multiple inputs, such as a Snapshot process working with the input from two Select processes, all persistent derived fields are available to the downstream process.

If a persistent derived field is not available across all incoming Select processes, and it is included in the output for a Snapshot process, then the Snapshot process displays a NULL value for that persistent derived field in all the output rows from the Select processes that did not have that persistent derived field.

If a persistent derived field is not available across all incoming Select processes, and you use it to define a Segment process, then the Segment process has empty segments for the Select processes that did not have that persistent derived field.

The Segment process remains unconfigured if you try to define a segment with an expression using more than one persistent derived field not available across all the Select processes.

The following guidelines apply to persistent derived fields (PDFs):

PDFs attach to an inbound cell (vector)
PDFs are calculated before query execution
Multiple PDFs are available in the following processes:
- Snapshot: If a PDF is not defined for a cell, its value = NULL. If a single ID is greater than one cell, one row is output for each cell.
- Segment: PDFs are not available for segmentation by field when multiple input cells are selected. PDFs must exist in all selected input cells for use in a segment by query.
PDFs keep only a single value (selected at random) per ID value, regardless of the number of times an ID value occurs in the data. Thus, when the output includes no table fields (and includes an ID), there will be only one record per ID value.
However, when you use a derived field based on a table field, the output includes a table field indirectly. Thus, there will be a record for each instance of an ID value. (In other words, if the ID value occurs seven times in the data, there will be seven records output.)

Persistent derived fields only store a single value for each audience ID, which is randomly selected from the available values. This means that when working with unnormalized data, you must use a GROUPBY macro function to achieve the desired behavior.

For example, say you want to find from the purchase transaction table, the highest dollar amount in a single transaction a customer has made and save this as a persistent derived field for downstream processing. You could write a derived field (and then persist it as a persistent derived field) as follows:

Highest_purchase_amount = groupby(CID, maxof, Purch_Amt)

Against unnormalized purchase transaction data such as the following, this would compute as follows:


CID	DATE	PURCH_AMT	HIGHEST_PURCHASE_AMOUNT
A	1/1/2007	$200	$300
A	3/15/2007	$100	$300
A	4/30/2007	$300	$300

When the derived field is persisted, it choose (randomly) any value (which are all $300) and persists the value $300 for customer A.

A second less obvious example might be to select a predictive model score from a scoring table for a specific model X. Here the derived field might look like this:

ModelX_score = groupby(CID, maxof, if(Model = 'X', 1, 0), Score)

And the data might look like:


CID	MODEL	SCORE	MODELX_SCORE
A	A	57	80
A	B	72	80
A	X	80	80

Persisting the derived field, ModelX_Score, gives the desired result of the score value of 80. It is incorrect to create a derived field:

Bad_ModelX_score = if(Model = 'X', Score, NULL)

This would result in the following:


CID	MODEL	SCORE	BAD_MODELX_SCORE
A	A	57	NULL
A	B	72	NULL
A	X	80	80

Then when you persist the derived field Bad_ModelX_score, the persisted value could be NULL or 80. If you are working with unnormalized data and the derived field values are not all the same, persisting that derived field could result in any of the values being returned. For example, defining Derived_field_Score = SCORE and persisting it could result in the value 57, 72, or 80 for customer A. To ensure desired behavior, you must use the GROUPBY macro over the customer ID and guarantee the derived field value is the same for all data for that customer.