1. Translate for hex values
2. Translate usage
3. How to remove all non-alphanumerics?
4. Character mapping


Translate for hex values

Mike Brown and David Carlisle

Is it possible to specify a hexadecimal value (00-ff) as the third argument to the translate function?

Well, you can't refer to bytes, per se, because XPath and XSLT functions operate on the data model prescribed in the XPath spec. This model does not deal with bytes; it deals with nodes, and (following the reference

Well, you can't refer to bytes, per se, because XPath and XSLT functions operate on the data model prescribed in the XPath spec. This model does not deal with bytes; it deals with nodes, and (following the reference to the XML Infoset spec) at a lower level, with Unicode/UCS characters, regardless of how they are represented when serialized as a byte stream.

Since an XML parser will resolve character references before the stylesheet tree is built, you should be able to put any Unicode characters in the arguments to the translate() function. For example, translate($someString,'abcde','ABCDE') is the same as ($someString,'abcde','ABCDE').

If your goal is to output bytes that represent characters from the Windows CP-1252 (or whatever) character set, you'll need to determine what the proper Unicode/UCS code points are for those characters, and then rely on your serialization mechanism to encode the characters as the bytes you want. It may take some experimentation depending on your particular situation.

David C adds:

> Something special about null character?

yes it's not allowed in XML. But then neither is character 1 (not checking for disallowed characters is a known feature of the IE parser, if I recall)

the XML char production is:

<prod id="NT-Char"><lhs>Char</lhs> 
<rhs>#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] 
| [#x10000-#x10FFFF]</rhs> 
<com>any Unicode character, excluding the
surrogate blocks, FFFE, and FFFF.</com> </prod>

In other words nothing in the ascii control range except tab, line feed and newline.


Translate usage

Steve Tinney

A function which detected whether any of the characters in a string occurred in another string. As often with XSL, a few moments thinking brought the realization that there is a way to do it:

   <xsl:if test=". = translate(., '.,:;','')>

Instead of testing the string to see if it contains any characters from the other string directly, you use translate() to strip out any occurrences of the query-set from (a copy of) the original string, and then compare it to the original to see if they're the same. If they're not, you know the string contains characters from the translated-out set. More complicated to describe than use, probably.

You can extend this principle to test, for example, if a string ends with a punctuation mark:

<?xml version='1.0'?>
<xsl:stylesheet version="1.0" 

<xsl:output method="text"/>

<xsl:template match="/">
  <xsl:call-template name="endswithpunct">
    <xsl:with-param name="str" select="'abc.'"/>
  <xsl:call-template name="endswithpunct">
    <xsl:with-param name="str" select="'abc'"/>

<xsl:template name="endswithpunct">
 <xsl:param name="str"/>
 <xsl:value-of select="$str"/><xsl:text> does </xsl:text>
         != 0">
   <xsl:text>not </xsl:text>
 <xsl:text>end with punctuation mark.



How to remove all non-alphanumerics?

Michael Kay

> My first thought was to use translate(), but the second 
> argument would have to contain every possible character I 
> want to remove, which seems pretty unwieldy. 

There's a trick to this:

translate($x, translate($x, 'abcde', ''), '')

will remove all characters except a,b,c,d, and e from your string $x.


Character mapping

David Carlisle et al

For a troff-to-XML/Unicode conversion I've implemented a strategy that
produces the desired result, but that does the conversion to Unicode
slowly, and I would be grateful for advice about improving the efficiency.

I handle the conversion of the structural marked up XML first, and I
wind up with all of my XML tagging in place, but the text strings use
troff escape sequences, rather than Unicode. The text is almost all
medieval Cyrillic, and most of the Cyrillic characters are represented
in the troff with sequences of several ascii characters. The strategy I
adopted to convert the troff character encoding to Unicode was to create
a mapping file for the troff-to-Unicode character correspondences.
Here's a snippet (a single mapping correspondence):



something like this would work I havent quite got the regexp right here but it should show the idea, which is construct a regexp that matches any escape sequence then once you find one, look it up in a key constructed by the mapping table to see what the replacement is.

<xsl:stylesheet version="2.0" 

<xsl:output encoding="US-ASCII"/>

<xsl:variable name="map" select="doc('pvl_mappings.xml')"/>
<xsl:key name="map" match="mapping" use="troff"/>

<xsl:template match="*">
   <xsl:copy-of select="@*"/>

<xsl:template match="text()">
 <xsl:analyze-string select="."

     <xsl:value-of select="(key('map',.,$map)/unicode,.)[1]"/>
       <xsl:value-of select="."/>


The OP asks

> There are two matches here: \(?s and \(?c . When my <xsl:choose> finds
> the first match (it's the first <xsl:when> within the <xsl:choose>),
> doesn't it just replace all instances of \(?s and then not read the rest
> of the <xsl:when> lines? That is, won't it fail to find the subsequent
> \(?c ?

Mike Kay responds

The xsl:matching-substring instruction is executed once for each match. So it's executed once to process \(?s, and once to process \(?c. In the first case, the first xsl:when fires. In the second case, the second xsl:when fires.