By Jacques Légaré, Senior Software Developer and Mario Blažević, Senior Software Developer
1. Motivation
OmniMark has had line-breaking functionality built-in ever since the XTRAN
days. This functionality can be used to provide rudimentary text formatting capabilities. The language-level support for line-breaking is described quite thoroughly in the language documentation.
OmniMark’s language-level line-breaking support is very simple to use, and aptly supports the use-case where all the output of a program needs to be similarly formatted. Where the performance is less stellar, however, is when line-breaking needs to be activated and deactivated on a fine-grained level. The reason for this is simple: when line-breaking is disabled (say, using the h
modifier), OmniMark cannot predict when it might be reactivated. As a result, it still needs to compute possible line-breaking points, just in case. As efficient as OmniMark might be, this can cause a significant reduction in performance, sometimes by as much as 15%.
As of version 8.1.0, the OmniMark compiler can detect some cases when line-breaking functionality is not being used, and thereby optimize the resulting compiled program to by pass line-breaking computations. However, the compiler cannot make this determination in general: this is an undecidable problem. For instance, consider the following somewhat contrived example:
process do sgml-parse document scan #main-input output “%c” done
replacement-break ” “ “%n”
element #implied output “%c”
element “b” local stream s
set s with (break-width 72) to “%c” output s
Note that line-breaking is only activated in the element
rule for b
, and so line-breaking will only be activated if the input file contains an element b
. The OmniMark compiler cannot be expected to predict what the input files might contain when the program is executed!
Another issue with OmniMark’s built-in line-breaking is that it does not play well with referent
s. Specifically, consider the following program:
process local stream s
replacement-break ” “ “%n”
open s as buffer with break-width 32 to 32 using output as s do xml-parse scan #main-input output “%c” done close s
element #implied output “%c” || “.” ||* 64
This program puts a hard limit of 32 characters on the maximum length of lines output to s
. When this program is executed, a run-time error is triggered in the body of the element
rule, where we attempt to output 64 periods. On the other hand, consider the following similar program:
process local stream s
replacement-break ” “ “%n”
open s as buffer with (referents-allowed & break-width 32 to 32) using output as s do xml-parse scan #main-input output “%c” done close s
set referent “a” to “.” ||* 64 output s
element #implied output “%c” || referent “a”
This program accomplishes virtually the same task, but instead uses a referent
to output the string of periods. In this case, no run-time error is triggered: the line-breaking constraints have been silently violated.
Because of these issues, it is better to use OmniMark’s built-in line-breaking only when necessary, whereas in other cases to implement line-breaking using other language constructs.
The remainder of this article discusses how to simulate line-breaking on PCDATA using string sink
functions.
2. string sink functions
A string sink
function is a function that can be used as the destination for string
s. In a very real sense, a string sink
function is the complement of a string source
function, which is used as the source of string
s. While a string source
function outputs its string
s to #current-output
, a string sink
function reads its string
s from #current-input
.
A string sink
function is defined much like any other function in OmniMark, the only difference being that the return type is string sink
: for example,
define string sink function dev-null as void #current-input
This is a poor man’s #suppress
, soaking up anything written to it.
A string sink
function can have any of the properties normally used to define functions in OmniMark: e.g., it can be overloaded
, dynamic
, etc …. The argument list of a string sink
function is unrestricted. However, in the body of a string sink
function, #current-output
is unattached.The form of OmniMark’s pattern matching and markup parsing capabilities makes string sink
functions particularly convenient for writing filters, taking their #current-input
, processing it in some fashion, and writing the result out to some destination. However, since #current-output
is unattached inside the function, we need to pass the destination as an argument. For this, we use a value string sink
argument. For example, a string sink
function that indents its input by a given amount might be written
define string sink function indent (value integer i, value string sink s) as using output as s do output ” “ ||* i repeat scan #current-input match “%n” output “%n” || ” “ ||* i match any-text* => t output t again done
The function indent
could then be used like any other string sink
:
; … using output as indent (5, #current-output) do sgml-parse document scan #current-input output “%c” done
(The ability to pass #current-output
as a value string sink
argument is new in OmniMark 8.1.0.)
You can find out more about string sink
functions in the language documentation.
3. Line-breaking in OmniMark
We can use a pair of string sink
functions to simulate to some extent OmniMark’s built-in line-breaking functionality. The benefit of this approach is that it impacts the program’s performance only where it is used.
3.1. Simulating insertion-break
To simulate the effect of insertion-break
on PCDATA we need to scan the input and grab as many characters as we can up to a specified width. If we encounter a newline in the process, we stop scanning. Otherwise, we output the characters we found, and append a line-breaking sequence provided by the user.
define string sink function insertion-break value string insertion width value integer target-width into value string sink destination as
We can start by sanitizing our arguments:
assert insertion matches any ** “%n” message “The insertion string %”” || insertion || “%” does not contain a newline character %”%%n%”.”
This assertion is not strictly necessary. However, OmniMark insists that the line-breaking sequence contain a line-end character, and so we do the same.
We can grab a sufficient number of characters from #current-input
by using OmniMark’s counted occurrence pattern operator:
using output as destination repeat scan #current-input match any-text{1 to target-width} => l (lookahead “%n” => n)? output l
The use of lookahead
at the end of the pattern allows us to verify if a %n
is upcoming: we should only output the line-breaking sequence if the characters we grabbed are not followed by a %n
.
output insertion unless n is specified match “%n” output “%n” again
We can then use this to break the text output from the markup parser: for example,
process using output as insertion-break “%n” width 20 into #current-output do sgml-parse document scan #main-input output “%c” done
3.2. Simulating replacement-break
Simulating insertion-break
on PCDATA is straightforward, because it can insert a line-breaking sequence whenever it sees fit. On the other hand, replacement-break
is slightly more complex, since it must scan its input for a breakable point. For clarity, the characters between two breakable points will be referred to aswords; if the breakable points are defined by the space character, they are effectively words.
define string sink function replacement-break value string replacement width value integer target-width to value integer max-width optional at value string original optional initial { ” “} into value string sink destination as
The argument original
is used to specify the character that delimits words; the argument is optional, as a space character seems like a reasonable default. target-width
specifies the desired width of the line. max-width
, if specified, gives the absolute maximum acceptable line width; if a line cannot be broken within this margin, an error is thrown. Finally, the argument replacement
gives the line-breaking sequence.
As before, we start by ensuring our arguments have reasonable values:
assert length of original = 1 message “Expecting a single character string,” || ” but received %”” || original || “%”.” assert replacementmatches any ** “%n” message “The replacement string %”” || replacement || “%” does not contain a newline character %”%%n%”.”
The second assertion is repeated from above, for the same reasons as earlier: OmniMark insists that the replacement string contain a newline, and so we will do the same. The first assertion insists that breakable points be defined by a single character; again, this is a carry-over from OmniMark’s implementation.
For replacement-break
, the pattern is very different from that of insertion-break
: in that case, we could consume everything with a single pattern, using a counted occurrence. This does not suffice with replacement-break
: rather, we have to consume words until we reach target-width
.
using output as destination do local stream line initial { “” }
The stream
line
will be used to accumulate text from one iteration to another.
repeat scan #current-input match ((original => replaceable)? any-text ** lookahead (original | “%n” | value-end)) => t
The pattern in the match
clause picks up individual words. If the line length is still below target-width
, we can simply append the word to the current line and continue with the next iteration:
do when length of line + length of t < target-width set line with append to t
If this is not the case, we can output the text we have accumulated thus far, so long as it does not surpassmax-width
else when max-width isnt specified | length of line < max-width output line output replacement when replaceable is specified set line to t droporiginal?
If all else fails, we could not find an acceptable breakable point in the line: OmniMark throws an error in this case, so we will do the same.
else not-reached message “Exceeded maximum line width” || ” of %d(max-width) characters.%n” || “The line is %”” || line || “%”.%n” done
Our string sink
function needs a few more lines to be complete. For one, our previous pattern does not consume any %n
that it might encounter. In this case, we should flush the accumulated text, and append a%n
:
match “%n” output line || “%n” set line to “” again
Lastly, when the repeat scan
loop finishes, there may be some text left over in line
, which needs to be emitted:
output line done
Just as was the case previously in Section 3.1, “Simulating insertion-break”, we can use our function to break text output from the markup parser: for example,
process using output as replacement-break “%n” width 10 to 15 into #main-output do sgml-parse document scan #main-input output “%c” done
4. Going further
We demonstrated in Section 1, “Motivation” that referent
s and line-breaking did not play well together: in fact, a referent
could be used to silently violate the constraints stated by a break-width
declaration. In the case of our string sink
simulations, referent
s are a non-issue: a referent
cannot be written to an internal string sink
function, which effectively closes the loophole.
OmniMark’s built-in line-breaking functionality can be manipulated using the special sequences %[
and %]
: by embedding one of these in a string that is output to a stream
, we can activate or deactivate (respectively) line-breaking. The easiest way of achieving this effect with our string sink
functions would be to add a read-only switch
argument called, say, enabled
, viz
define string sink function insertion-break value string insertion width value integer target-width enabled read-only switch enabled optional into value string sink destination as
and similarly for replacement-break
. We could then use the value of this shelf item to dictate whether the functions should actively break their input lines, or pass them through unmodified.
Breaking lines using string sink
functions in this fashion is really only the beginning. For instance, we could envision a few simple modifications to replacement-break
that would allow it to fill paragraphs instead of breaking lines: it would attempt to fill out a block of text so that all the lines are of similar lengths.
The code for this article is available for download.