Regular expressions in Opportunity Detect

In Opportunity Detect, you use regular expressions in two situations.

  • When you create a Real time file connector in the Server Groups page, you use a regular expression to match the pattern used in file names.
  • When you create a Boolean expression in a component and you select the Like operator, you can use a regular expression to set the criteria for comparison.

Opportunity Detect uses the Streams standard toolkit for matching regular expressions. Opportunity Detect supports the POSIX extended regular expressions standard.

The regular expression must conform to the Streams Processing Language requirements, described here: https://www-01.ibm.com/support/knowledgecenter/SSCRJU_3.2.0/com.ibm.swg.im.infosphere.streams.spl-language-specification.doc/doc/primitivetypes.html

Take care that the pattern you specify exactly matches your intent. Some level of testing is always advisable to verify that your patterns are actually matching the required expressions. You can use a trial and error process to design patterns, starting with low complexity and changing them bit by bit to achieve the required result. Pay particular attention to escaping backslashes.

Special characters

Here is a summary of special character usage in POSIX regular expressions.

  • Period (.) : Matches any character.
  • Anchors (^, $) : The (^) anchor defines the start of the expression, and the ($) anchor defines the end of the expression.
  • Asterisk (*) : A quantifier that matches a single character or group of characters any number of times.
  • Plus (+) : A quantifier that matches a single character or a group, one or more times.
  • Question mark (?) : A quantifier that represents optional items.

Bracket expressions

A bracket expression represents a class of characters, any one of which could be a match a single character. For example [a-c] is a bracket expression that will match any of the characters a, b, or c. For example: the regex [a-c]+ will match aaa, abc, ca, etc; or any string containing a sequence of at least one character from the set a, b, or c followed by any number of characters also from that set.

There are other forms of bracket expressions. For example, [a-c] could be also specified as [abc]. Within a bracket expression, there are collating elements. It has the form [.col.]. (There might be other forms.) A collating element is a character or group of characters that act as a single character in a bracket expression. For example, if [.ae.] is a collating element, then it can be used within a bracket expression [[.ae.]bc], which states: match any of the characters "ae", b, or c. In other words, it forces ae to be treated as a single character.

Table 1. Character classes
POSIX Description ASCII
[[:alnum:]] Alphanumeric characters [a-zA-Z0-9]
[[:alpha:]] Alphabetic characters [a-zA-Z]
[[:blank:]] Space and tab [ \t]
[[:cntrl:]] Control characters [\x00-\x1F\x7F]
[[:digit:]] Digits [0-9]
[[:graph:]] Visible characters (that is, anything except spaces, control characters, etc.) [\x21-\x7E]
[[:lower:]] Lowercase characters [a-z]
[[:print:]] Visible characters and spaces (that is, anything except control characters, etc.) [\x20-\x7E]
[[:punct:]] Punctuation and symbols [!"#$%"()*+,-./:;<=>?@[\]^_`{}~]
[[:space:]] All whitespace characters, including line breaks [ \t\r\n\v\f]
[[:upper:]] Uppercase letters [A-Z]
[[:xdigit:]] Hexadecimal digits [A-Fa-f0-9]

Quantification

The question mark makes the preceding token in the regular expression optional. For example, colou?r matches both colour and color.

The star (*) tells the engine to attempt to match the preceding token zero or more times. The plus sign (+) tells the engine to attempt to match the preceding token one or more times.

An additional quantifier allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is zero or a positive integer indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite.

For example:
  • {0,1} is the same as ?
  • {0,} is the same as *
  • {1,} is the same as +
Omitting both the comma and max tells the engine to repeat the token exactly min times.

You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of the word boundaries.

Grouping

Single characters, or expressions matching single characters, enclosed in parentheses (round brackets), are treated as a regular expression matching a single character. That is, quantification and other rules apply to the group in the parentheses as a whole.

Alternation

Two regular expressions separated by the special character vertical-line ( '|' ) match a string that is matched by either.

For example, the regular expression "a((bc)|d)" matches the string "abc" and the string "ad".

Single characters, or expressions matching single characters, separated by the vertical bar and enclosed in parentheses, are treated as a regular expression matching a single character.

Example for file name matching

You might create the following regular expression to match timestamp suffixed file names used with the Real time file connector.

Detect\.a\.trans\.[0-9]{8,14}

This expression matches file names with the common prefix Detect.a.trans and ending with timestamp digits of length greater than 8 and less than 14. This is done because file names can have 8 digits for the basic date (4 for year, 2 for month, 2 for date) and 6 extra digits for more granular timestamps (hh:mm:ss).


Detect.a.trans.20100901
Detect.a.trans.20100908
Detect.a.trans.20100922
Detect.a.trans.20101001
Detect.a.trans.20101008
Detect.a.trans.20101022
Detect.a.trans.20101201
Detect.a.trans.20101208
Detect.a.trans.20101222
Detect.a.trans.20101222
Detect.a.trans.20101223040506
Detect.a.trans.20101223033240

Useful links for POSIX regular expressions