This is a piece of software called noteParser, which takes very simple markup and produces XML, using Java. The parser works on plain text input, with some syntax defined in an external text file, then feeds via Sax2 events, the Saxon XSLT 1.0 engine, which can transform the generated XML into any required output XML format.
I've Mike Kay to thank for this one. I'd skimmed the Sax2 book from O'Reilly, and I'd understood the basic outline of filters. What I'd clearly missed was the way in which it can be used in a pipeline. Mike Kays XSLT 1.0 book, page 676, has an innocuous statement, along the lines of
Saxon -x GencomParser ......
in fact the full command line to use this is
$java -cp .:saxon653.jar com.icl.saxon.StyleSheet -o op.xml \
-w1 -x noteParser <inputTextFile> <styleSheet>
Eventually, following the code Mike provides with his book, I came to the conclusion that the -x option (which I use almost daily, but to specify a xerces XML parser) can also be used to provide a form of parser most unlike XML. Elsewhere on this site I have documented my progress using formatted plain text (elsewhere called markdown) and a couple of bits of Python to produce XML from that plain text. Once I'd had the Eureka moment, I figured this was yet another way into XML, and started to adapt Mikes code to process my enotes format (syntax) and made quite good progress.
The basics are that the parse routine eventually boils down to recognising block level elements, indicated by a token, newline through to first whitespace != eoln.
token Remainder of line to next eoln is the block level content.
Then the content is from that
whitespace through to eoln. Block level elements are restricted to two
levels, as seen in
listItem. The other is of the
para kind, with no nesting.
Inlines are again of two kinds, those with start and end, such as emphasis, and those requiring attributes. So far I've only implemented the former. That is basically it for the parse. The recognition of these events is propogated through to the XSLT engine.
The XSLT engine then triggers off the markup (treated as XML), so I can do an identity transform to get the XML, or I can transform to some other vocabulary, which is a nice touch. The command line is shown below.
$java -cp .:saxon653.jar com.icl.saxon.StyleSheet \ -o op.xml -w1 -x noteParser notetest.txt identity.xsl
This uses the identity transform which to produce a file
op.xml, using the tagset embedded in the
code. I hope to extract this to a separate class to make the
code more general purpose. Included so far is simple block level
processing as an extra class. Comments welcome.
Until further notice, this is the java source, as a zip file. Also included is an input test file and the two files definiing the simple block and inline markup.
I need to express my thanks to Wileys, Mike's publisher, for allowing me to modify and publish the code.