This paper outlines some principles and techniques for writing more effective and efficient patterns in OmniMark. If you’re not yet, then by the end of this paper, you’ll be an OmniMark pattern guru.
Writing better patterns addresses two goals: making the patterns do more for you, and getting them to run fast. The former is always more important than the latter — there’s no point having a program running fast if it isn’t doing what you want it to — but they are not incompatible, and very often more effective patterns will run faster than less effective ones, because they are more to the point.
There are three main principles discussed in this paper:
- Fail Fast — write patterns that don’t waste time,
- Succeed Slow — write patterns that do their job efficiently, and
- Divide and Conquer — build patterns to cover all the cases.
Fail Fast, Succeed Slow
All OmniMark programmers, at one time or another, ask themselves if their “find” rules could or should be running faster. Most of the time it doesn’t matter. If you’re not waiting around for your OmniMark program to run, there isn’t a problem. But if your thumbs are getting tired of twiddling, it might be time to take a look at your find rules.
Perhaps the most important thing that should be kept in mind about using find rules is that they spend most of their time failing to produce results. At most, only one find rule will end up capturing a piece of text, so all the rules ahead of it are going to fail. This leads to the first principle of writing find rules: “fail fast” — or, spend as little time as possible on find rules that are failing.
Most find rules already fail quite fast; the first character in the find rule isn’t the one being looked at, most of the time, and OmniMark takes advantage of this in the way it chooses the find rules to look at. One thing that you should avoid, if you can, is a find rule that starts with any, and which usually fails later in the pattern — i.e. an any match that isn’t a “catch-all” at the bottom of the program.
The second principle of writing find rules is “succeed slow” — once you are in a find rule, or in a repeat scan match, pick up as much with it as you can. It is often the case in word processor formats, for example, that commands come in bunches. So pick up bunches, not just single commands. This cuts down the number of find rules that are performed.
With these two principles in mind, let’s visit an old friend:
find any
This little rule ends up sitting at the bottom of many an OmniMark program. It can sit by itself as above, in which case it means “please throw away anything you haven’t yet recognized.” Or it can do something simple, as in the following, where it copies the otherwise unrecognized character to the current output:
find any => one-character output one-character
In both cases, this little rule usually provides an excellent opportunity to speed up the OmniMark program. First of all, on the “fail fast, succeed slow” principles, it makes sense to pick up as long a run of characters as possible that will not be recognized by any other rule.
What these characters are depends very much on other find rules at work. For example, if other rules recognized text starting with any of the seven-bit ASCII graphic characters, then a rule such as:
find [\ " " to "~"]+
will “fail fast,” if the character being looked at is a seven-bit ASCII graphic, as well as “succeed slow,” because it will spend the time it takes to find all the following characters. And the nice thing about such a rule is that it can be put anywhere in the program — it’s not getting in the way of other find rules.
Just be careful to pick up any leftover characters that are sometimes, but not always picked up by other find rules. Leaving the good old “find any” rule in the program, after the “gobbler,” does this, or it can be combined as follows:
find [\ " " to "~"]+ | any
Note that the any doesn’t have a “+” sign. If it did, it would consume the rest of your input, without giving the other find rules a chance to examine it.
Finally, remember that OmniMark copies characters that are unrecognized by any find rule to the current output, and does it in a very efficient way. So if you simply want all otherwise unrecognized characters copied, don’t do the copying yourself — not doing anything is the best way to fail.
Alternatively, if you want to discard all characters not otherwise captured by find rules, you should make your default output be #suppress, and explicitly output to #main-output or wherever the output is going. Doing so makes unmatched characters go to #suppress — efficiently discarding them.
Divide and Conquer
A common need in pattern matching is to pick up everything up to, but not including, a known closing delimiter. For example, a match part that picks up the text in a quoted literal can do so as follows, with a quote character ending the picked-up text:
match "%"" [\ "%""]* => text "%""
For other than single-character closing delimiters, a simple “any except” character set doesn’t do the job. The following shows why:
match "--" [\ "-"]* => text "--"
The problem is that a single “-“, not part of a “–“, will terminate the matching done by the “any except” but won’t be matched by the final “–“. That’s where the lookahead used in the upto macro comes into play — it makes sure that the whole of the terminating delimiter is present.
Here’s a common solution to this problem, using the handy lookahead:
match "--" ((lookahead not "--") any)* "--"
What’s going on here is that the pattern repeatedly “looks ahead” to make sure that the terminating condition has not yet been met, and if it hasn’t, consumes another character. When the termination condition is found, the “*” loop exits and finds the following match, the final “–“.
The lookahead formulation does a good job of picking up text. But for those wanting to “fail fast, succeed slow”, it’s really unsatisfactory, because it examines every character in the text twice — once in the lookahead to ensure it’s not part of the closing delimiter, and a second time in the any gobbler. A better approach is one that “divides and conquers” — examining only once any character that isn’t a “-” using the “any except” form, and only doing a lookahead when a “-” is encountered:
match "--" ([\ "-"]+ | "-" lookahead not "-")* => text "--"
The divide and conquer approach to writing patterns comes in handy even when lookahead isn’t required. For example, the following match part picks up a C-like string, gobbling everything but quotes and “\” quickly, and handling “\” separately (so that a quote can be put in the text using “\””):
match "%"" ([\ "%"\"]+ | "\" any)* => text "%""
When the patterns start to become large and more complex, divide and conquer is the real winner. Here’s how to write a divide and conquer pattern in general:
- For each character that is only sometimes matched, based on the character or sequence of characters that may precede it (such as characters preceded by “\” in C-like strings), construct an alternative for all possibilities starting with the first character of the preceding characters. For example:
"\" any ; for escaped characters in C-like strings "%" any ; for escaped characters in OmniMark strings
- For each character that is only sometimes matched, based on the character or sequence of characters that may follow it (such as the “-” possibly followed by another “-” in XML-like comments), construct an alternative that matches that character with a “lookahead” or “lookahead not” that excludes the non-matching cases. For example:
"<" lookahead not [letter | "/!?"] ; for illegal uses of "<" in XML
- Pick out all the characters that must always be matched, but are not matched by one of the previously constructed alternatives, and match them using a character set matcher with a “+” on it. For example:
[" " to "~" \ "%"\"]+ ; for what's allowed "as is" in C strings [\ "%"%%"]+ ; for what's allowed "as is" in OmniMark strings
- Take the partial patterns from the first three steps, connect them with “|” (or), putting the most likely alternatives first (for speed only). For example:
([" " to "~" \ "%"\"]+ | "\" any) ; for C string text
- Append an “*” (repeat zero or more times) or a “+” (repeat one or more times) to the connected partial patterns, depending on whether or not the text as a whole can consist of zero characters. The result will look something like:
([" " to "~" \ "%"\"]+ | "\" any)*
- Recursively apply divide and conquer to any of the constructed alternatives that itself matches delimited text.
As an example of divide and conquer, here’s a find rule that matches an XML start tag:
find "<" (letter [letter | digit | ".-_"]*) => element-name ([\ "%"'<>/"]+ | "%"" [\ "%""]* "%"" | "'" [\ "'"]* "'")* => attributes ("/" => empty-element)? ">"?
The core of the pattern (that matches the attributes) is a three-way alternative that picks up everything except a quote or apostrophe, and then tries each of the two types of quoted text. (Quoted text generally needs to be specially recognized because it can contain things that would be recognized as delimiters outside of quotes.)
** and ++
If you’ve used the “**” and “++” pattern operators introduced in OmniMark version 6, you’ll maybe be wondering where they fit into all of this. What “**” and “++” do is take the principles discussed above and apply them in a number of common and useful cases.
“**” and “++” take a (preceding) character set and (following) pattern, and match everything within that character set up to and including the pattern. (They differ only in that “++” fails if it does not encounter at least one character prior to the pattern.) For example, a convenient way of writing the XML comment matcher shown earlier in this paper is:
match "--" any** "--"
This is certainly shorter than the previous formulations. More importantly, it is easier to read and it runs efficiently.
OmniMark uses the divide and conquer principle on the pattern following “**” or “++” and builds a loop that only stops when it needs to look further, in the same way that the divide and conquer rewrite did. It saves the programmer the trouble, and is able to do things that can be hard for the programmer.
Although they’re very useful, and should get a lot of use, “**” and “++” don’t deal with all pattern matching problems. It’s good for the programmer to understand the principles described in this paper when they have to be applied explicitly.