Unicode

1. UTF-8 display in Windows from xsl:message
2. The Unicode Bidirectional Algorithm
3. Transcoding
4. Doctype declaration, etc.
5. Global parameters with UTF-8 characters and ???s

1.

UTF-8 display in Windows from xsl:message

Abel Braaksma



> The standard
> Windows shell is unable to display UTF-8 correctly.

It is not that hard to make windows command shell work with UTF-8. The default codepage for the Windows Shell is 437, which, I believe, is something ancient and not really useful with unicode. To enable UTF-8 support on Windows Console, you must do four things:

Trigger the unicode support for all pipe etc (default is ansi) with the command:

    cmd /u
Set the font of the console to one that has glyphs in the unicode range:
- go to console system menu (Alt-Space)
- select Properties > Font
- select "Lucida Sans" (MS will automatically select "Lucida Sans Unicode" when it is needed and when it is available on your system)

Change the codepage of the console screen to use Unicode (default is IBM 437) with the command:

    chcp 65001

Call your commands *without* using a batch file (won't work anymore...). You can put your command in an environment variable for convenience.

If you now run saxon, you will see the output as Unicode.

However, this won't solve your problem with xsl:message (sorry), because Saxon seems to emit the messages of xsl:message and fn:trace using Latin-1 encoding or similar (I believe it were nicer if Saxon would output in UTF-8, but maybe this is Sun Java's problem, not Saxon's). Resulting in empty output after the above mentioned procedure. Even if you use the 2>out.txt method mentioned earlier in this thread, you still need Saxon to output as UTF-8 firstly to get the correct results.

Some of this info is from this blog: msdn.com

2.

The Unicode Bidirectional Algorithm

Tony Graham

The Unicode Bidirectional Algorithm in a nutshell...

Bidirectional types

Unicode characters have a "bidirectional type". There's lots of types, but they're divided into three categories: strong, weak, and neutral.

Characters with a strong bidirectional type really know their directionality. For example, the characters in most alphabets are "strongly" left-to-right, and the characters in the Hebrew and Arabic alphabets (and some others) are "strongly" right-to-left.

Characters with a weak bidirectional type determine their directionality according to their proximity to other characters with strong directionality.

Characters with a neutral bidirectional type determine their directionality from either the surrounding strong text or the embedding level.

Embedding levels

The Unicode Bidirectional Algorithm works in terms of "levels" of right-to-left text embedded with left-to-right text, and vice versa.

Even levels (0, 2, 4...60) are left-to-right. Odd levels (1, 3..61) are right-to-left.

Text at an even level is rendered left-to-right. Text at an odd level is rendered right-to-left.

The Unicode Bidirectional Algorithm works on paragraphs, so the first step is dividing text into paragraphs. You determine the "paragraph embedding level" by finding the first character in the paragraph with a strong bidirectional category. If the character is strongly left-to-right, the paragraph embedding level is 0, otherwise (i.e. if the character is strongly right-to-left), the embedding level is 1.

Embedding goes on from there: contained text with the opposite directionality is at the next embedding level, and text with the original directionality that is contained by the text with the opposite directionality is at the next lowest embedding level.

Explicit bidirectional formatting

Unicode includes characters for fudging the embedding level:

- RLE, Right-to-Left Embedding, says treat the following text as right-to-left. I.e., it forces the embedding level to the next lowest odd number: level 0 -> level 1; 1 or 2 -> 3, etc.
- LRE, Left-to-Right Embedding, says treat the following text as left-to-right. I.e., it forces the embedding level to the next lowest even number: 0 or 1 -> 2; 2 or 3 -> 4, etc.
- RLO, Right-to-Left Override, says treat the following characters as strong right-to-left characters. I.e. it forces an odd embedding level, but it also sets the "override status" to right-to-left so the implementation knows which way to push those neutral types.
- LRO, Left-to-Right Override, says treat the following characters as strong left-to-right characters. I.e. it forces an even embedding level, but it also sets the "override status" to left-to-right so the implementation knows which way to push those neutral types.
- PDF, Pop Directional Format, is the generic "end-tag" for the previous RLE, LRE, RLO, or LRO character.
- RLM, Right-to-Left Mark, is a zero-width (i.e. it doesn't print) character that is used as an invisible spot of strong right-to-left directionality to coerce neighbouring weak and neutral characters into behaving the way you want. This doesn't change the embedding level.

The example in the Unicode Standard shows RLM being used with an exclamation mark (i.e., '!') that is between some left-to-right text and some neutral text, all of which is within some right-to-left text. Without the RLM, the ! is treated as part of the span of left-to-right text. With the RLM between the left-to-right text and the !, the ! is treated as part of the right-to-left text, which changes on which end of the left-to-right text it is rendered.

- LRM, Left-to-Right Mark, is a zero-width (i.e. it doesn't print) character that is used as an invisible spot of strong left-to-right directionality to coerce neighbouring weak and neutral characters into behaving the way you want. This doesn't change the embedding level.

RLM and LRM are good if you know what you're doing, you probably have an editor that lets you represent them, and you're worried about conserving embedding levels. For the rest of us, the other five characters represent the brute-strength and ignorance approach that we're more comfortable with.

Bidirectional conformance

Systems do not need to support any explicit directional formatting codes.

The "implicit bidirectional algorithm" can be taken as handling bidirectionality based solely on embedding levels and the characters' bidirectionality types and without any overrides.

There isn't an "explicit bidirectional algorithm" as such. The explicit codes distort the embedding levels compared to what they would ordinarily be, but after they've been taken into account, the "implicit" algorithm, based on embedding levels and characters' types, is what finally determines which text is rendered in which direction.

Higher-level protocols

The "permissible ways for systems to apply higher-level protocols to the ordering of bidirectional text" are:

- Override the paragraph embedding level
- Override the number handling to use information provided by a broader context (Let's not go there.)
- Replace, supplement, or override the bidirectional overrides or embedding codes
- Override the bidirectional character types assigned to control codes to match the interpretation of the control codes used within the protocol (Let's not go there either.)
- Remap the number shapes to match those of another set (Ditto.)

HTML, CSS, and XSL do the first and third only.

HTML

HTML has a "dir" attribute for indicating the direcionality of text. The allowed values are RTL and LTR. I.e., "dir" overrides the paragraph embedding level and replaces the embedding codes.

HTML also has a <bdo> element that is used for overriding the effects of the bidirectional algorithm on a span of text. I.e., it replaces the override codes.

The HTML Recommendation warns against mixing its controls with explicit bidirectional override characters. Hardly surprising.

CSS2

As Paul noted, CSS has a "direction" property with values "ltr" and "rtl" (and "inherit"). It specifies "the base writing direction of blocks and the direction of embeddings and overrides for the Unicode BIDI algorithm." I.e., it overrides the paragraph embedding level for blocks (i.e., for what Unicode considers paragraphs) and it's also used for replacing the bidirectional overrides and embedding codes.

The "unicode-bidi" property is the other half of how CSS2 replaces the bidirectional overrides and embedding codes. The allowed values are "normal", "embed", "override", and "inherit".

'unicode-bidi: normal' doesn't do anything, which is why 'normal' is the default value.

'unicode-bidi: embed' is equivalent to RLE (when 'direction: rtl') or LRE (when 'direction: ltr') at one end of a span of text and a PDF of the other.

'unicode-bidi: override' is equivalent to RLO (when 'direction: rtl') or LRO (when 'direction: ltr') at one end of a span of text and a PDF of the other.

XSL

XSL has "direction", "unicode-bidi" and "writing-mode" properties, although they don't all apply to all the same formatting objects.

"writing-mode" applies to the formatting objects that set up a "reference-area", i.e., to the big-picture formatting objects that specify the page, the regions with the page, to tables, and to table cells. It affects how you sequence blocks of text, but it also overrides the "paragraph embedding level."

"direction" and "unicode-bidi" apply only to the "bidi-override" formatting object. They behave pretty much like in CSS2, except that the inital value of "direction" is derived from the current "writing-mode" value rather than being explicitly "ltr".

(Determing the initial value of "direction" this way probably means fewer surprises when formatting a purely right-to-left document, but the "direction" description does read like it was written for "direction" to apply to more formatting objects than just "bidi-override".)

Conclusion

1. If using markup to control bidirectionality, you need a way to set the paragraph embedding level (i.e., set whether the paragraph starts out right-to-left or left-to-right) as well as a way to override the implicit bidirectional algorithm (the algorithm that works w.r.t. the characters' bidirectional types).

2. Markup that overrides the implicit bidirectional algorithm should support both overrides (RLO and LRO equivalent) and embeds (RLE and LRE equivalent).

3. Include strong words against mixing markup-based bidirectionality controls and the explicit bidirectionality characters.

4. Consistency with existing standards is a GOOD THING. Compatibility with the Unicode Bidirectional Algorithm is essential.

5. Work out whether every inline can affect bidirectionality (CSS style) or whether there's one special-purpose element (HTML and XSL style, although I don't expect XHTML to stick to that and it doesn't matter for HTML anyway if you're also using CSS).

6. A politically correct default direction is hard to determine. CSS2 uses 'ltr', and XSL lets the XSL processor have a default.

3.

Transcoding

Mike Brown

> Do you know of any applications that take in
> a file in encoding, say iso8859-1 and drop out utf-8 please.

perhaps this one? iro.umontreal.ca

Kevin Bulgrien offers:

iconv is a tool that is available on both Windows and GNU/Linux platforms. I use it extensively with Microsoft UTF-16LE files to convert them back and forth from one encoding to another. Some options for win32 are libiconv at: sourceforge and gnuwin32.sourceforge

4.

Doctype declaration, etc.

Mike Brown

The post to which I'm replying had nothing direclty to do with XSLT, but I feel compelled to respond, because the information in it is rife with errors, and because I'm obsessed with character encoding.

.. wrote:
> I work with SGML.  When you declare "DOCTYPE" the composing/processing
> engine is going to expect a DTD.

Do not try to divine the XML parsing model or the XSLT processing model just based on the default, apparent behavior of your favorite toolsets and their usually less-than-thorough documentation.

I don't know what the requirements are for SGML parsers, but XML parsers have much leeway as to when they are required to read a DTD, and what parts of the DTD they must read (for example, external parts are optional).

More importantly, the XML parser's user has control over whether the parser tries to validate or not. And the parser (say, Expat), can be set to do things like read external entities but not external DTDs, allowing situations where you can still parse a document that contains an entity reference without a corresponding entity declaration, so long as the standalone declaration agrees.

Furthermore, XML document authors have flexibility in what they can do.. for example <!DOCTYPE blah> is legal even though it does not contain any DTD info at all.


>  Can you declare the necessary encoding
> in the XML declaration (<?xml version="1.0" encoding="ISO-8858-1"?>)

ISO-8859-1. And an encoding declaration is an informative hint to the XML parser to tell it how the *bytes* of the document (think of what you see if you look at the document in a hex editor rather than a text editor) should be converted to Unicode characters as it is read in.

There is only one correct encoding that you can declare: the one actually used for producing the bytes that comprise that particular document. It has to be accurate, or "close enough" in the case of, say, a US-ASCII encoded document being declared as UTF-8. You cannot just make it up.


> and then use the Unicode number?

"using the Unicode number" in more correct terminology is "using a (numeric) character reference" like "&#232;" or "&#xE8;"

By definition, a character reference always uses Unicode code points. So "&#232;" or "&#xE8;" are both referring to Unicode character number 232 (decimal), which happens to be the small Latin letter e with grave accent.

When using a character reference, the fact that the document was encoded with whatever encoding was used is irrelevant. &#232; always means Unicode character at code point 232, never "byte 232 in encoding XYZ", unless you are using that nonconformant abomination known as Netscape Navigator (or Communicator) version 4.


>  I have a table that says that &egrave; has
> a UTC code of #x00E8

To hopefully clear up your confusion with more correct terminology...

The predefined HTML entity named "egrave" has as its replacement text the actual character number E8 (hexadecimal) of the Universal Character Set (UCS): small Latin letter e with grave accent.

You can more or less think of entities as text macros, although every document or binary 'file' is on some level an entity, so it's not a perfect analogy. Please try to distinguish between a named "entity reference" and a numeric "character reference" though. Then you can get creative and say "character entity reference" when you mean things like "&egrave;" so long as the egrave entity's replacement text is a single character.

The UCS is the normative basis of SGML, HTML, and XML, and is defined by ISO/IEC 10646, the international standard that assigns numbers to the idea of nearly every character used in nearly every written language script on the planet. This standard is often informally referred to as Unicode because it is developed in tandem with and shares its character assignments with The Unicode Standard, a more thorough but perhaps less political publication that does not fall under the ISO's jurisdiction.

UTC (what you said) means Greenwich/Zulu time zone, pretty much...

5.

Global parameters with UTF-8 characters and ???s

Mike Kay, Andrew Welch, David Carlisle




> I am having problems with global parameters which have UTF-8
> characters in them.  They show up as question marks when I
> use their values in the output (e.g. <xsl:value-of
> select="$global-parameter"/>).

Does the same problem affect the same character if it originates from a place other than the global parameter? For example, what happens when you do <xsl:value-of select="'&#x....'"/>?

If the problem occurs in this situation, then the two possible explanations are

(a) your output device isn't configured to display the character (no glyph in the chosen font)

(b) the software used to display the output (e.g. a text editor or a browser) doesn't know that the output is encoded in UTF-8.

If the problem doesn't occur in this situation, then the problem is with the contents of the parameter, which probably means it's something to do with the encoding of the resourceBundle.

Andrew adds

I've just noticed in your other post that you are using JSPs, in which case also ensure you set the pageEncoding in the page directive:

<%@ page pageEncoding="UTF-8" .....

David Carlisle wraps it up with

Most (but not all) encodings used today encode 0-9, a-z, A-Z using the same code points so if you ensure that your file only has these (and some punctuation) then most of the time the file will work no matter what encoding it is specified as having. If I'm generating files that other people may put on web servers I usually try to always use us-ascii as the encoding (and if using xslt2 then omit the encoding declaration so it will be taken as utf8 (which is also correct as ascii files are also valid utf8 files). If your file uses non-ascii characters then you need to declare the correct encoding. Most likely your files were in utf8 but your web server was declaring tehm to be iso-8859-1 (you can check that by looking in your browser (view/character encoding) in firefox, something similar in IE. If the file is displaying incorrectly but manually using the encoding menu to change the encoding makes the fiel display correctly then it's almost certainly the fact that the server is specifying the wrong encoding to the browser.(Most web servers do _not_ look at the file to determine what encoding to specify, they just use a site or directory default encoding for that file type).

specifying US-ASCII is a good solution for some kinds of files in some work scenarios, but not all.

Advantages: * when it works, it works, and is very simple to do.

Disadvantages,

* the encoding is rather inefficient, an e-acute is one byte in iso-8859-1, 2 bytes in utf8 but at least 6 (&#xe9;) bytes in us-ascii. So if your file is English with the occasional non-breaking space or currency symbol, it's not too bad, to have the occasional character encoded this way, but in some langauges your file is 5 or 6 times larger

* You can not use the mechanism at all if non-ascii characters are used in places where the &# notation is not available, so if any such characters appear in comments, or in processing instructions, or in element or attribute names, this is not an option at all.