Active GUI element

Static GUI element

Code

WPS object

File/Path

Command line

Entry-field content

[Key combination]

Maul Publisher goes Unicode

by Peter Koller, © June 2006

With the modern requirement to provide advanced editing features, Unicode becomes essential. Maul Publisher V3.06 is the first version to use the Unicode library found in recent versions of OS/2 and eComStation. But what does that mean to the end user?

What is Maul Publisher

Maul Publisher is an industrial strength desktop publisher capable of creating virtually all of the printing seen on everyday household items. You can use it to easily lay out newspapers, cards, books, labels, stamps, posters, charts, forms, and even designs like building plans or furniture arrangements.

In essence, the application combines text and images on a page. Once the page has been created, you can print it, create a PDF document with it, or turn it into an image or metafile. Maul has some uniquely powerful tools to deal with both pictures and text, and is specially designed to provide the best quality possible for a given printer.

Because the application tunes its output to the printer, you must have a printer installed. The resolution available to a printer is between four to eight times finer than that available on a screen, and because of this, and the fact that rounding is minimised, the printed output from Maul is usually stunningly clear.

What is Unicode

Unicode is designed to support a character set larger than 255 codepoints. This has several distinct advantages:

The character set can support many languages
More codepoints are available for symbols and special graphics
The application has less processing to do with languages and codepages
The Unicode API provides advanced character testing features

With version 3.06, Maul Publisher includes the OS/2 Unicode API and this has a significant impact on how the application decides where to place text, and what text to place.

Maul and Unicode

The Unicode API provides a much more useful set of character tests. This probably has very little impact for western languages where words are separated by spaces, but provides major improvements for languages where words are not separated by spaces, such as Japanese. By testing for _punctstart and _punctend attributes, Maul can now correctly format pictogram strings in quotemarks.

The additional characters available in Unicode enable Maul to support smart quotes for the first time. I have called them Intelligent text quotes, because smart quotes is the phrase used by MS Office. And anyway—Maul does it better:

quotes example

Fig. 1. Intelligent text quotes example

The Unicode character set enables character lookup by name so the application doesn't need to know the codepoint of a particular character. This made it possible to add a bulleted (and numbered) list tool which considerably simplifies the addition of lists to your text articles.

Bullet example

Fig. 2. Bullets and lists example

Maul and character testing

Because Maul was developed by just one person—me—I can only guess which character properties to use when formatting a text article. This means that I must depend on the end user—you—to help determine what these character tests should be.

The character test is used to divide sentences up into words. The words then determine how much text fits onto a line. Where appropriate, hyphenation breaks words in two when they do not fit on a line.

Generally, for western languages the space character determines where a word ends. However, there are situations where the space character is not available. This can happen where a comma is used to separate two words, such as in hello,there. Maul can break up this string by testing for characters with a break attribute. The space character is a classic example of a character that has a break attribute.

For pictogram languages such as Japanese, every character has a break attribute. This behaviour must be modified when the pictogram is in quotes. This is achieved by using an attach attribute. The attach attribute overrides the break attribute of the previous character. Characters that the Unicode API tests as _punctend are marked with both a break and an attach attribute. All the alphanumeric characters have no attributes set.

It turns out that you can break up a string into words in any language with just the two attributes described above. The example below shows how this works with some Kanji text. I have shown the attributes on the second line. Note that the Kanji string is not meant to mean anything in particular.

Attributes example

Fig. 3. Attributes example

Because of the attach attributes ab above, the separated words include their closing quotes.

If these attributes are wrong, for example bopomofo is not marked as breaking, then the text formatter will tend to fail. As I don't use these languages, I rely on you to tell me if it doesn't work!

Unicode limitations

Because Maul Publisher relies on zero based text escape sequences called LOLs, only UTF-8 Unicode is supported. The UTF-8 Unicode codepage number is 1208. UTF-8 Unicode is a format that can be processed by systems that work on a per character basis. It is distinguished by the fact that it never starts with a zero. The UTF-8 Unicode codeset can consist of codepoints of 5 or more bytes in size.

OS/2 supports codepoints of only 3 bytes at present, and Maul is designed with this in mind. This provides the full gamut of characters available in the Unicode compatible fonts available for OS/2. UTF-8 Unicode takes up more space than normal 16 bit Unicode, and to test characters the UTF-8 characters must first be converted into 16 bit Unicode.

The Insert character dialog provided in Maul Publisher shows the byte sequence as it is found in the file, the 16 bit Unicode equivalent, the character attributes, and where possible the name of the character.

The full list of character attributes as Maul displays them is:

CHARCLASS_BREAKING 0x001
CHARCLASS_ATTACH   0x002
CHARCLASS_SPACE    0x004
CHARCLASS_HYPHEN   0x008
CHARCLASS_QUOTE    0x010
CHARCLASS_RQUOTE   0x020

So the right quote in the example image above (Figure 3) has the code [0033]:

Insert Char Dialog Image

Fig. 4. Insert char dialog