Purpose #
Patterns are used for matching text when the exact text to identify is not exactly known. If you know the exact text you are looking for, you can use a simple string comparison such as is equal to. Perhaps there are different variations of the text you are interested in (e.g. optional bits) and for this reason a simple text comparison is insufficient. This is where patterns play an important role in helping you identify content of interest.
Patterns are often used to identify content in the text flow of your document. But, you can also use them to match the value of any property or to match style names.
When you use a pattern in a rule condition’s data content test, the pattern is tested on the data content of all elements to which the rule applies. So, for paragraph-based rules, paragraph content is what the pattern is tested on. If you use patterns within a contains element test, then the data content is the content of the contained element.
Simple String Pattern #
Alternatives in Patterns #
A pattern can be used to match one of several specified alternatives. The pipe symbol | is used to separate the alternatives. For example, cat|dog will match either of the inputs cat or dog. Note that the patttern will not match catdog, it will only match one of the alternatives.
Quantifiers #
Quantifiers are another mechanism for broadening the range of content identified by a pattern. Quantifiers allow you to identify repetition of content. The edge case is to indicate that 0 or 1 repetitions is acceptable, that is that the presence of the content is optional. Here is a complete list of quantifiers.
Quantifier | Description | Example |
---|---|---|
? | the content is optional and may appear 0 or 1 times | A? matches only the empty string or the string A |
* | the content may appear 0 or more times | A* matches the empty string, or the strings A, AA, AAA etc. |
+ | the content may appear 1 or more times | A+ matches the strings A, AA, AAA etc. |
{n,m} | the content must appear at least n times and at most m times | A{2,4} matches only the strings AA, AAA, AAAA |
{n} | the content must appear exactly n times | A{4} matches only the string AAAA |
{n,} | the content must appear at least n times | A{2,} matches the strings AA, AAA, AAAA etc. |
Quantifier | Description |
---|---|
?? | non-greedy version of ? |
*? | non-greedy version of * |
+? | non-greedy version of + |
Example: greedy vs non-greedy patterns
Pattern | Input | Explanation |
---|---|---|
(.*)(\d+) | abc123 | successful match, but the first capture contains abc12 and the second 3 |
(.*?)(\d+) | abc123 | successful match, but the first capture contains abc and the second 123 |
(1?)(\d+) | 123 | successful match, but the first capture contains 1 and the second 23 |
(1??)(\d+) | 123 | 123 successful match, but the first capture contains nothing and the second 123 |
Example: unanchored patterns
Pattern | Explanation |
---|---|
.*Ice Nine.* | matches content that contains the string “Ice Nine” |
Character Classes #
A character class identifies a set of characters that a pattern should match. There are a few possibilities.
Description | Example |
---|---|
explicit list the characters to match | [aeiou] will match content consisting of a single vowel |
range of characters to match | [0-9] will match any (base 10) digit |
complement of character class | [^aeiou] will match any consonant |
combination of two ranges | [a-zA-Z] will match any upper or lower case letter |
combination of a range and explicit characters | [_:a-z] will match underscore, colon or any lower case letter |
character class subtraction | [\S-[:-]] will match any non-white space character except for colons and dashes |
There are some builtin character classes. You can use these inside a character class definition (i.e. inside the square brackets) or outside.
Builtin Class | Description |
---|---|
\n | new line character (#xA) |
\r | carriage return character (#xD) |
\t | tab character (#x9) |
. | anything except a newline or carriage return (i.e. [^\n\r]) |
\s | space, tab, newline or carriage return (i.e. [#x20\t\n\r]) |
\S | non-space character (i.e. [^\s]) |
\i | a letter, underscore or colon |
\I | not a letter, underscore or colon (i.e. [^\i]) |
\d | same as [0-9] |
\D | same as [^\d] |
\w | common characters found in words, excludes punctuation and other separators |
\W | same as [^\w] |
Example: character class
Pattern | Explanation |
---|---|
ABC\tDEF | matches content that starts with ABC, is followed by tab, and then ends with DEF |
Grouping #
It is sometimes necessary to group contiguous parts of you pattern. For example, if a quantifier is intended to apply to more than one part of your pattern, you need to group these. This is done using parenthesis.
Example: grouping
Pattern | Explanation |
---|---|
([A-Z][a-z]*)+ | Matches camel-cased strings (e.g. ThisIsAVeryLongIdentifier). Note that the * applies just to the character class for lower case letters, but the + applies to the combination of the two grouped character classes. |
Metacharacters #
As we have seen, some characters have special meaning when constructing patterns. If you want to refer to these characters literally you need to escape them with a backslash. So, if you’d like to match a question mark, you must type \?. Here is the full list of metacharacters
. \ ? * + { } ( ) [ ].
Anonymous Captures #
Patterns are useful for identifying relevant content. They can also be used to pick out some of the intersting parts for use. This ability to tease apart content is really very powerful when it comes to creating your rules. In Migrate, these content references can be used in annotation arguments.
Anonymous captures are created by surrounding parts of your patterns in parentheses. You can then reference the matched portions with the backslash notation: \1 for the first capture, \2 for the second, and so on. The captures are counted left-to-right within a pattern, and top-down if you have more than one pattern in your rule. It is the placement of these parentheses in your patterns that permits the matched content to be referenced in this way. This is why they are called captures — you are capturing content.
Example: anonymous captures
Pattern | Use | Explanation |
---|---|---|
(.*)\.tiff | set-attribute(src=\1.jpg) | Change the extension of .tiff images .jpg. |
Version:\s+(\d+)\.(\d+) | p.map.product-version(version=\1;release=\2) | Extract the major an minor numbers from a version string in order to populate map metadata. |
\(\d{3}\)\s*\d{3}-\d{4} | prolog.meta.othermeta((area code)(\1)) | Extract the area code of a phone number and place it in prolog metadata. Note that in this example the pattern had to match literal parentheses. These have been escaped by a backslash. When escaped, the parentheses lose their special meaning for grouping and indicating captures. |
Named Captures #
If you like, you can name your captures in your patterns. Doing so means that you won’t have to worry about counting your groups. This is also more robust because the count can be thrown off if you add a capture to the rule at some later time. A meaningful name also indicates the purpose of the capture more clearly to others working with the rule set.
You name your captures by using curly braces. You reference the capture with a backslash followed by the capture name in braces.
Example: named captures
Pattern | Use |
---|---|
Version:\s+({major}\d+)\.({minor}\d+) | p.map.product-version(version=\{major};release=\{minor}) |
Character Class Escapes #
See the W3C XML Schema Datatypes specification for more details on character class escapes.