Schema questions

1. What are the benefits of schema-aware XSL processors?


Michael Kay

The benefits fall into two categories: robustness and optimization. Optimization is still a theoretical, speculative benefit, so I'll concentrate on robustness.

You can argue the case in high-level abstract terms, or with low-level coding examples. Let's try a bit of each.

Firstly, stylesheets are written with knowledge of the input and output schema, but at the moment this knowledge is in the programmer's head and isn't shared with the compiler. This means that when the programmer makes mistakes, due to incorrect reading of the schema, or perhaps because the schema has changed, the compiler cannot detect them. It's good software engineering discipline to describe the inputs and outputs of a component (the preconditions and postconditions) and this applies to XSLT as much as anything else. The more complex the schema becomes (and some industrial schemas are very complex indeed) the harder it is for the programmer to keep everything in their head. In addition, it's very hard to achieve 100% test coverage. Many schemas contain parts that are only rarely used, but if you want to produce a production-quality stylesheet you need confidence that it can handle everything that will be thrown at it.

At a practical level there's no doubt that debugging and testing XSLT stylesheets is currently rather difficult, and most of us don't do it very rigorously. We tend to test a stylesheet on a rather small sample of input documents, and we check the output visually to see if it looks OK, perhaps running a few sample outputs through a schema checker if we're being conscientious. When we get things wrong it can be very hard to spot where the trouble is, especially if it's in code written by someone else a while back when the schema was rather different from the way it is now.

I like to demonstrate this by taking a correct stylesheet and introducing random errors, and showing how without a schema they produce bad output that can be very difficult to spot (in one example I can cite, it meant that out-of-range numbers were not being highlighted as they should have been, and no-one noticed), while if you make the same error with a schema-aware processor, you get an explicit error message telling you exactly what's wrong.

If you can define the schema for your input and output documents and make this known to the XSLT processor, this can make a big difference to the development cycle. In practical terms, the biggest benefit I've seen is from integrated validation of the result document: if your stylesheet tries to write invalid output, you get an error message pinpointing exactly where the error in your stylesheet is, rather than 300 identical errors from the schema processor telling you that the output is wrong, which you probably knew anyway. At present with Saxon this validation is mainly done at run-time, but more and more of it will be done at compile time, which means that you even get to know about errors in code that hasn't been executed because you haven't written the test data for it yet.

I've yet to see such a big impact from using a schema for the input document, but I think the potential is there too. The biggest potential advantage is better reuse of stylesheet code, and better resilience to changes in the input schema, by driving your template rules from schema types rather than lexical patterns. This mainly applies to the kind of complex schemas with hundreds of element types. In such schemas there is usually some kind of type hierarchy, and if it is well-designed then you should be able to get the kind of benefits you see from object-oriented programming, by writing code that's generic or specific as the need arises.

I hope that gives you something to think about!