Breaking lines on your own with OmniMark

By Jacques Légaré, Senior Software Developer and Mario Blažević, Senior Software Developer

1. Motivation

OmniMark has had line-breaking functionality built-in ever since the XTRAN days. This functionality can be used to provide rudimentary text formatting capabilities. The language-level support for line-breaking is described quite thoroughly in the language documentation.

OmniMark’s language-level line-breaking support is very simple to use, and aptly supports the use-case where all the output of a program needs to be similarly formatted. Where the performance is less stellar, however, is when line-breaking needs to be activated and deactivated on a fine-grained level. The reason for this is simple: when line-breaking is disabled (say, using the h modifier), OmniMark cannot predict when it might be reactivated. As a result, it still needs to compute possible line-breaking points, just in case. As efficient as OmniMark might be, this can cause a significant reduction in performance, sometimes by as much as 15%.

As of version 8.1.0, the OmniMark compiler can detect some cases when line-breaking functionality is not being used, and thereby optimize the resulting compiled program to by pass line-breaking computations. However, the compiler cannot make this determination in general: this is an undecidable problem. For instance, consider the following somewhat contrived example:

replacement-break ” “ “%n”process do sgml-parse document scan #main-input output “%c” doneelement #implied output “%c” element “b” local stream s set s with (break-width 72) to “%c” output s

Note that line-breaking is only activated in the element rule for b, and so line-breaking will only be activated if the input file contains an element b. The OmniMark compiler cannot be expected to predict what the input files might contain when the program is executed!

Another issue with OmniMark’s built-in line-breaking is that it does not play well with referents. Specifically, consider the following program:

replacement-break ” “ “%n”process local stream sopen s as buffer with break-width 32 to 32 using output as s do xml-parse scan #main-input output “%c” done close s element #implied output “%c”  || “.” ||* 64

This program puts a hard limit of 32 characters on the maximum length of lines output to s. When this program is executed, a run-time error is triggered in the body of the element rule, where we attempt to output 64 periods. On the other hand, consider the following similar program:

replacement-break ” “ “%n”process local stream sopen s as buffer with (referents-allowed & break-width 32 to 32) using output as s do xml-parse scan #main-input output “%c” done close s set referent “a” to “.” ||* 64 output s element #implied output “%c”  || referent “a”

This program accomplishes virtually the same task, but instead uses a referent to output the string of periods. In this case, no run-time error is triggered: the line-breaking constraints have been silently violated.

Because of these issues, it is better to use OmniMark’s built-in line-breaking only when necessary, whereas in other cases to implement line-breaking using other language constructs.

The remainder of this article discusses how to simulate line-breaking on PCDATA using string sink functions.

2. string sink functions

A string sink function is a function that can be used as the destination for strings. In a very real sense, a string sink function is the complement of a string source function, which is used as the source of strings. While a string source function outputs its strings to #current-output, a string sink function reads its strings from #current-input.

A string sink function is defined much like any other function in OmniMark, the only difference being that the return type is string sink: for example,

define string sink function dev-null as void #current-input

This is a poor man’s #suppress, soaking up anything written to it.

A string sink function can have any of the properties normally used to define functions in OmniMark: e.g., it can be overloaded, dynamic, etc …. The argument list of a string sink function is unrestricted. However, in the body of a string sink function, #current-output is unattached.The form of OmniMark’s pattern matching and markup parsing capabilities makes string sink functions particularly convenient for writing filters, taking their #current-input, processing it in some fashion, and writing the result out to some destination. However, since #current-output is unattached inside the function, we need to pass the destination as an argument. For this, we use a value string sink argument. For example, a string sink function that indents its input by a given amount might be written

define string sink function
 indent (value integer     i,
 value string sink s)
 as
 using output as s
 do
 output ” “ ||* i       repeat scan #current-input
 match “%n”
 output “%n” || ” “ ||* i      match any-text* => t
 output t
 again
 done

The function indent could then be used like any other string sink:

; …
 using output as indent (5, #current-output)
 do sgml-parse document scan #current-input
 output “%c”
 done

(The ability to pass #current-output as a value string sink argument is new in OmniMark 8.1.0.)

You can find out more about string sink functions in the language documentation.

3. Line-breaking in OmniMark

We can use a pair of string sink functions to simulate to some extent OmniMark’s built-in line-breaking functionality. The benefit of this approach is that it impacts the program’s performance only where it is used.

3.1. Simulating insertion-break

To simulate the effect of insertion-break on PCDATA we need to scan the input and grab as many characters as we can up to a specified width. If we encounter a newline in the process, we stop scanning. Otherwise, we output the characters we found, and append a line-breaking sequence provided by the user.

define string sink function
 insertion-break       value string      insertion
 width value integer     target-width
 into value string sink destination
 as

We can start by sanitizing our arguments:

assert insertion matches any ** “%n”
 message “The insertion string %”” || insertion
 || “%” does not contain a newline character %”%%n%”.”

This assertion is not strictly necessary. However, OmniMark insists that the line-breaking sequence contain a line-end character, and so we do the same.

We can grab a sufficient number of characters from #current-input by using OmniMark’s counted occurrence pattern operator:

using output as destination
 repeat scan #current-input
 match any-text{1 to target-width} => l (lookahead “%n” => n)?
 output l

The use of lookahead at the end of the pattern allows us to verify if a %n is upcoming: we should only output the line-breaking sequence if the characters we grabbed are not followed by a %n.

output insertion
 unless n is specified   match “%n”
 output “%n”
 again

We can then use this to break the text output from the markup parser: for example,

process
 using output as insertion-break “%n” width 20 into #current-output
 do sgml-parse document scan #main-input
 output “%c”
 done

3.2. Simulating replacement-break

Simulating insertion-break on PCDATA is straightforward, because it can insert a line-breaking sequence whenever it sees fit. On the other hand, replacement-break is slightly more complex, since it must scan its input for a breakable point. For clarity, the characters between two breakable points will be referred to aswords; if the breakable points are defined by the space character, they are effectively words.

define string sink function
 replacement-break       value string      replacement
 width value integer     target-width
 to value integer     max-width    optional
 at value string      original     optional initial { ” “}
 into value string sink destination
 as

The argument original is used to specify the character that delimits words; the argument is optional, as a space character seems like a reasonable default. target-width specifies the desired width of the line. max-width, if specified, gives the absolute maximum acceptable line width; if a line cannot be broken within this margin, an error is thrown. Finally, the argument replacement gives the line-breaking sequence.

As before, we start by ensuring our arguments have reasonable values:

assert length of original = 1
 message “Expecting a single character string,”
 || ” but received %”” || original || “%”.”   assert replacementmatches any ** “%n”
 message “The replacement string %”” || replacement
 || “%” does not contain a newline character %”%%n%”.”

The second assertion is repeated from above, for the same reasons as earlier: OmniMark insists that the replacement string contain a newline, and so we will do the same. The first assertion insists that breakable points be defined by a single character; again, this is a carry-over from OmniMark’s implementation.

For replacement-break, the pattern is very different from that of insertion-break: in that case, we could consume everything with a single pattern, using a counted occurrence. This does not suffice with replacement-break: rather, we have to consume words until we reach target-width.

using output as destination
 do
 local stream line initial { “” }

The stream line will be used to accumulate text from one iteration to another.

repeat scan #current-input
 match ((original => replaceable)? any-text
 ** lookahead (original | “%n” | value-end)) => t

The pattern in the match clause picks up individual words. If the line length is still below target-width, we can simply append the word to the current line and continue with the next iteration:

do when length of line + length of t < target-width
 set line with append to t

If this is not the case, we can output the text we have accumulated thus far, so long as it does not surpassmax-width

else when max-width isnt specified
 | length of line < max-width
 output line
 output replacement
 when replaceable is specified            set line to t droporiginal?

If all else fails, we could not find an acceptable breakable point in the line: OmniMark throws an error in this case, so we will do the same.

else
 not-reached message “Exceeded maximum line width”
 || ” of %d(max-width) characters.%n”
 || “The line is %”” || line || “%”.%n”         done

Our string sink function needs a few more lines to be complete. For one, our previous pattern does not consume any %n that it might encounter. In this case, we should flush the accumulated text, and append a%n:

match “%n”
 output line || “%n”
 set line to “”
 again

Lastly, when the repeat scan loop finishes, there may be some text left over in line, which needs to be emitted:

output line
 done

Just as was the case previously in Section 3.1, “Simulating insertion-break”, we can use our function to break text output from the markup parser: for example,

process
 using output as replacement-break “%n” width 10 to 15 into #main-output
 do sgml-parse document scan #main-input
 output “%c”
 done

4. Going further

We demonstrated in Section 1, “Motivation” that referents and line-breaking did not play well together: in fact, a referent could be used to silently violate the constraints stated by a break-width declaration. In the case of our string sink simulations, referents are a non-issue: a referent cannot be written to an internal string sink function, which effectively closes the loophole.

OmniMark’s built-in line-breaking functionality can be manipulated using the special sequences %[ and %]: by embedding one of these in a string that is output to a stream, we can activate or deactivate (respectively) line-breaking. The easiest way of achieving this effect with our string sink functions would be to add a read-only switch argument called, say, enabled, viz

define string sink function
 insertion-break         value     string      insertion
 width value     integer     target-width
 enabled read-only switch      enabled      optional
 into value     string sink destination
 as

and similarly for replacement-break. We could then use the value of this shelf item to dictate whether the functions should actively break their input lines, or pass them through unmodified.

Breaking lines using string sink functions in this fashion is really only the beginning. For instance, we could envision a few simple modifications to replacement-break that would allow it to fill paragraphs instead of breaking lines: it would attempt to fill out a block of text so that all the lines are of similar lengths.

The code for this article is available for download.