Time is the enemy in content conversion projects

All content conversion projects are time sensitive to one degree or another.

If you are a content aggregator, it is essential that your customers receive up to date information. For publishers of technical content migrating to a new authoring system or integrating a steady stream of OEM content into your own documentation set, converted content must be ready to meet publication deadlines, and must not add time-consuming tasks to the work of content creators.

Yet performing content conversion in a timely manner is beset with difficulties. You not only have to worry about the total conversion time of the overall project, but also about the development time for automating conversion processes, and the execution time of the processes themselves.

The 80/20 rule

The 80/20 rule applies to content conversion. You can do 80% of the task for 20% of the time and cost by automating the basic format conversion and ignoring all the special cases and exceptions. Often you can buy a tool or service to do the 80% out of the box, or with minimal configuration and scripting. But the remaining 20% of the conversion still has to be done, and if that is left to manual cleanup, the total conversion time will be unacceptably long.

No content conversion process is ever going to be perfect (including manual cleanup), but if you can improve the automated conversion process to perform at 99% or better, you can reach an acceptable level of data quality for most projects. To accomplish this, however, you are going to need to do a lot more development work on your conversion process to handle your unique content and business rules. Off-the-shelf tools are not going to suffice for the last 20%.

All in one shot

You can write conversion scripts to do the entire conversion in one pass through the data. However, this is going to create a very large and complex application that will be hard to develop and debug, and hard to maintain and adapt when new content sources or new business rules are introduced. The additional development time is going to mean that your total conversion time will once again be unacceptably long.

The step by step approach

To simplify development and maintenance of the conversion, you can break the conversion process up into multiple steps. You can’t simply take the output of the easy 80% conversion and automate the last 20%, because the easy 80% loses too much of the context that you will need to automate the last 20%. But you can break the process up into smaller steps. This will allow you to use different tools for each step, such as Perl, XSLT, or Java, whichever is best for a particular process.

This approach makes your conversion process more maintainable, but it is also much slower to execute and uses more resources because of the need to serialize and parse the content between each step, and the memory required to buffer the content between processes. And while the architecture is more flexible, there is a lot of extra code to write and debug in order to correctly serialize the output of each step, and then to parse it again for the next step. Overall, the total conversion time still suffers.

OmniMark 10 conversion pipelines

With conventional tools, there is little more you can do to optimize the development process and/or the overall conversion time. With OmniMark 10, however, there is another option. OmniMark 10 allows you to create conversion pipelines which can be broken down into small steps without the need to serialize and parse the data between each conversion step.

How does it work?

Like some other tools, OmniMark 10 uses an event-based parsing approach. Unlike other tools, however, OmniMark 10 allows you to combine multiple parsing sources in a common parse event stream, and to generate parse events at each stage in the pipeline. Because each filter in the pipeline can catch incoming parse events and insert new parse events into the parse event stream, there is no need to serialize data between filters, which means the pipeline runs faster and uses fewer resources.

Solving the time crunch

Because there is no need to serialize and parse between each step, you can break the propcess down much more finely, which keeps each filter as simple as possible and allows you to build a library of reusable filters. This helps you to maintain and update your conversion pipeline with minimal effort and disruption. Because OmniMark 10 is a full-featured content processing platform, there is no need to use different programming languages for different parts of the process. All the capabilities you need for content processing are present in OmniMark 10.

Taken together, these features provide the solution to the content conversion time crunch: rapid development and rapid execution add up to rapid completion of the content conversion.