Validation on the rack

Mark Baker's "Validation considered harmful" touched off a fun series of responses.

[C]onsider the scenario of two parties on the Web which want to exchange a certain kind of document. Party A has an expensive support contract with BigDocCo that ensures that they’re always running the latest-and-greatest document processing software. But party B doesn’t, and so typically lags a few months behind. During one of those lags, a new version of the schema is released which relaxes an earlier stanza in the schema which constrained a certain field to the values “1″, “2″, or “3″; “4″ is now a valid value. So, party B, with its new software, happily fires off a document to A as it often does, but this document includes the value “4″ in that field. What happens? Of course A rejects it; it’s an invalid document, and an alert is raised with the human [administrator], dramatically increasing the cost of document exchange. All because evolvability wasn’t baked in, because a schema was used in its default mode of operation; to restrict rather than permit.

Upon reading this I had 2 immediate reactions:

  1. Yep. Walter Perry was going on about all this sort of thing a long time ago, and the industry would be in a much saner place, without, for example crazy ideas such as WS-Kaleidoscope and tight binding of documents to data records (read WXS and XQuery). For an example of how Perry absolutely skewered class-conscious XML using a scenario somewhat similar to Mark's, read this incisive post. To me the perils of bondage-and-discipline validation are as those of B&D datatyping. It's all more example of the poor design that results when you follow twopenny Structured Programming too far and let early binding rule the cosmos.

  2. Yep. This is one of the reasons why once you use Schematron and actually deploy it in real-life scenarios where schema evolution is inevitable, you never feel sanguine about using a grammar-based schema language (not even RELAX NG) again.

Dare's response took me aback a bit.

The fact that you enforce that the XML documents you receive must follow a certain structure or must conform to certain constraints does not mean that your system cannot be flexible in the face of new versions. First of all, every system does some form of validation because it cannot process arbitrary documents. For example an RSS reader cannot do anything reasonable with an XBRL or ODF document, no matter how liberal it is in what it accepts. Now that we have accepted that there are certain levels validation that are no-brainers the next question is to ask what happens if there are no constraints on the values of elements and attributes in an input document. Let's say we have a purchase order format which in v1 has a element which can have a value of "U.S. dollars" or "Canadian dollars" then in v2 we now support any valid currency. What happens if a v2 document is sent to a v1 client? Is it a good idea for such a client to muddle along even though it can't handle the specified currency format?

Dare is not incorrect, but I was surprised at his reading of Mark. When I considered it carefully, though, I realized that Mark did leave himself open to that interpretation by not being explicit enough. As he clarified in comment to Dare:

The problem with virtually all uses of validation that I've seen is that this document would be rejected long before it even got to the bit of software which cared about currency. I'm arguing against the use of validation as a "gatekeeper", not against the practice of checking values to see whether you can process them or not ... I thought it goes without saying that you need to do that! 8-O

I actually think this is a misunderstanding that other readers might easily have, so I think it's good that Dare called him on it, and teased out the needed clarification. I missed it because I know Mark too well to ever imagine he'd ever go so far off in the weeds.

Of course the father of Schematron would have a response to reckon with in such debate, but I was surprised to find Rick Jelliffe so demure about Schematron. His formula:

schema used to validating incoming data only validates traceable business requirements

Is flash-bam-alakazam spot on, but somewhat understated. Most forms of XML validation do us disservice by making us nit-pick every detail of what we can live with, rather than letting us make brief declarations of what we cannot live without. Yes Schematron's phases provide a powerful mechanism for elagant modularization of expression of rules and requirements, but long before you go that deep Schematron sets you free by making validation an open rather than a closed operation. The gains in expressiveness thereby provided are near astonishing, and this is despite the fact that Schematron is a less terse schema language than DTD, WXS or RELAX NG.

Of course XML put us on the road to unnecessary bondage and discipline on day one when it made it so easy, and even a matter of recommendation, to top each document off with a lordly DTD. Despite the fact that I think Microformats are a rickety foundation for almost anything useful at Web scale, I am hopeful they will act as powerful incentive for moving the industry away from knee-jerk validation.

[Uche Ogbuji]

via Copia

Service Modeling Language

I'd long ago put up a very thick lens for looking at any news from the SOA space. With analysts and hungry vendors flinging the buzzword around in a mindless frenzy it came to the point where one out of twenty bits of information using the term were pure drivel. I do believe there is some substance to SOA, but it's definitely veiled in a thick cloud of the vapors. This week Service Modeling Language caught my eye through said thick lens, and I think it may be one of the more interesting SOA initiatives to emerge.

One problem is that the SML blurbs and the SML spec seem to have little substantive connection. It's touted as follows:

The Service Modeling Language (SML) provides a rich set of constructs for creating models of complex IT services and systems. These models typically include information about configuration, deployment, monitoring, policy, health, capacity planning, target operating range, service level agreements, and so on.

The second sentence reads at first glance as if it's some form of ontology of systems management, basically an actual model, rather than a modeling language. No big deal. I see modeling languages and actual models conflated all the time. Then I catch the "typically", and read the spec, and it becomes evident that SML has much more to it than "models of complex IT services and systems". It's really a general-purpose modeling language. It builds on a subset of WXS and a subset of ISO Schematron, adding a handful of useful data modeling constructs such as sml:unique and sml:acyclic (the latter is subtle, but experienced architects know how important identifying cyclic dependencies is to risk assessment).

I'm still not sure I see the entire "story" as it pertains to services/SOA and automation. I guess if I use my imagination I could maybe divine that an architect publishes a model of the IT needs for a service, and some management tool such as Tivoli or Unicenter generates reports on a system to flag issues or to assess compatibility with the service infrastructure need? (I'm not sure whether this would be a task undertaken during proposal assessment, systems development, maintenance, all of the above?). I imagine all the talk of automation in SML involves how such reports would help reduce manual assessment in architecture and integration? But that can't be! Surely SML folks have learned the lessons of UDDI. Some assessment tasks simply cannot be automated.

SML shows the fingerprints of some very sharp folks, so I assume I'm missing soemthing. I think that much more useful than the buzzword-laden blurbs for SML would be an document articulating some nice, simple use cases. Also, I think the SML spec should be split up. At present a lot of its bulk is taken up defining WXS and ISO Schematron subsets. It seems a useful profile to have, but it should be separated from the actual specification of SML modeling primitives.

[Uche Ogbuji]

via Copia

Schematron creeping on the come-up (again)

Schematron is the boss XML schema language your boss has never heard of. Unfortunately it's had some slow times of recent, but it's surged back with a vengeance thanks to honcho Rick Jelliffe with logistical support from Betty Harvey. There's now a working mailing list and a Wiki. Rick says that Schematron is slated to become an ISO Standard next month.

The text for the Final Draft Internation Standard for Schematron has now been approved by multi-national voting. It is copyright ISO, but it is basically identical to the draft at www.schematron.com

The standard is 30 pages. 21 are normative, including schema listings and a characterization of Schematron semantics in predicate logic. Appendixes show how to use other query language bindings (than XSLT1), how to use Schematron as a vocabulary, how to express multi-lingual dignostics, and a simple description of the design requirements for ISO Schematron.

Congrats to Rick. Here's to the most important schema language of them all (yes, I do mean that). I guess I'll have to check Scimitar, Amara's fast Schematron processor for compliance to the updated draft standard.

[Uche Ogbuji]

via Copia

IDREF=NPC?

Interesting (as always) musings from Rick Jelliffe:

There has been good work on the theoretical classification of schema languages over the last two or three years.

My impression is that as soon as your schema language supports IDREF it is stuffed, from an NP POV: Schematron, XSD, RELAX NG, DTDs, the lot!

Theoretical classification is important for know what the characteristics of things are, and what pathological problems implementations may have to deal with. But people who reject one language or another on theoretical grounds alone, without considering their pragmatic value, need to have an alternative otherwise they are troll-like.

The emphasis is mine, marking the bit that caught my attention. So my guess is that in his theory IDREF==NPC (NP complete). Interesting. I haven't done any formal analysis, but when I ponder it, I can imagine ID reference checking as similar to problems I know to be P. I can't really think off head of a similar, known NPC problem, but that doesn't mean much. It can be fairly subtle factors of a problem that establish its NP profile.

It is worth noting, as Eric van der Vlist once had to remind me, that ID/IDREF integrity support is not mandated in RELAX NG validators. Another example of the far-sightedness of Mr. Clark, Murata-san and co? From "The Design of RELAX NG":

The RELAX NG TC spent a considerable amount of time considering what support RELAX NG should provide for enforcing identity (uniqueness and cross-reference) constraints. In the end, the conclusion was that identity constraints were better separated out into a separate specification. Accordingly, RELAX NG itself provides no support for identity constraints. RELAX NG DTD Compatibility 12 provides support for traditional XML ID/IDREF attributes. There were a number of reasons for preferring separation. One reason is the relative difference in maturity. RELAX NG is based on finite tree automata; this is an area of computer science that has been studied for many years and is accordingly mature and well understood. The use of grammars for specifying document structures is based on more than 15 years of practical experience. By contrast, the area of identity constraints (beyond simple ID/IDREF constraints) is much less mature and is still the subject of active research. Another reason is that it is often desirable to perform grammar processing separately from identity constraint processing. For example, it may be known that a particular document is valid with respect to a grammar but not known that it satisfies identity constraints. The type system of the language that was used to generate a document may well be able to guarantee that it is valid with respect to the grammar; it is unlikely that it will be able to guarantee that it satisfies the identity constraints. A document assembled from a number of components may guaranteed to be valid with respect to a grammar because of the validity of the components, but this will often not be the case with identity constraints. Even when a document is known to satisfy the identity constraints as well as be valid with respect to the grammar, it may be necessary to perform identity constraint processing in order to allow application programs to follow references. Another reason is that no single identity constraint language is suitable for all applications. Different applications have identity constraints of vastly different complexity. Some applications have complex constraints that span multiple documents 22. Other applications need only a modest increment on the XML ID/IDREF mechanism. A solution that is sufficient for those applications with complex requirements is likely to be overkill for those applications with simpler requirements.

Well reasoned, like almost everything in RELAX NG (and Schematron).

[Uche Ogbuji]

via Copia

Small fix to atom.rnc, and what about xml:space?

RobertBachmann stopped by #atom to mention that he'd tried to run an Atom file on the non-normative RELAX NG for the Atom RFC draft (I haven't seen an RNC for the final RFC itself). It failed because he used xml:lang in an atom:name child of atom:author. This contradicts the Atom spec, which says:

Any element defined by this specification MAY have an xml:lang attribute, whose content indicates the natural language for the element and its descendents.

The RNC did not specify this attribute in a couple of cases. The RNC is non-normative, but in this case there is no reason for divergence from the spec. I whipped up an atom.rnc that fixes the bug. Here's the diff from the version I found on-line.

This did set up a discussion between Anne van Kesteren and me. I feel that xml:lang only makes sense for some Atom elements, and that perhaps allowing it on all of them could be confusing. What, for example, does it mean to have xml:lang on the atom:uri child of atom:author? I suppose an outlandish (pun intended) interpretation could be references to localized sites, but that's really the province of the likes of XHTML's hreflang attribute. Moreover, I'm a bit puzzled by the bit from the Atom spec that seems to support my leaning:

The language context is only significant for elements and attributes declared to be "Language-Sensitive" by this specification.

So if it's not significant, why allow it? I think maybe there should have been a split in attribute sets between atomCommonAttributes and a atomCommonLanguageSensitiveAttributes, where the former would omit xml:lang.

Also, I'm used to the convention where xml:lang is used with content models that allow a language-sensitive element to be repeated, providing for multiple language versions in the same document. There are many cases in Atom where this would not be possible. For example, you could not have an English atom:title and a French one within the same atom:entry element. You could get tricky with by using a single atom:entry with type="xhtml" and multiple language versions within the xhtml:div, but this feels a bit constricting.

Anne doesn't mind xml:lang everywhere, and pointed out that xml:lang="" is an option for specifying no language context (rather than language context inherited from parent). I think in the end I could go either way on xml:lang everywhere.

This discussion also made me think of xml:space. This special attribute might get a mention right in the XML spec, but that doesn't mean it doesn't have to be addressed in XML applications. Even in the case of DTD, the spec says

In valid documents, this attribute, like any other, must be declared if it is used.

The same goes for RELAX NG, the conventional schema language for Atom. There is no xml:space to be found in either the normative RFC or non-normative schema, but the rules for Atom undefinedAttribute do allow for this attribute (as well as xml:id and just about any other XML or 'global' attribute). I assume that the intention is for applications to treat this attribute using the suggested semantics in the XML 1.0 spec. I do wish Atom had been explicit about this as is, for example, the XSLT 1.0 spec.

[Uche Ogbuji]

via Copia

The uneven state of Schematron

It has been a sketchy time for Schematron fans. Rick Jelliffe, the father of Schematron has had pressing matters that have prevented him from putting much work into Schematron for a while. Many questions still remain about the technology specification, and alas, there appears to be no place to keep discussions going on these and other matters (the mailing list on SourceForge appears to be defunct, with even the archives giving a 404. Here are some notes on recent Schematron developments I've come across.

I wasn't paying enough attention and I just came across the new Schematron Web site. Launched in February, it supersedes the old Academia Sinica page. Some the content was copied without much editing from the older site. The overview says "The Schematron can be useful in conjunction with many grammar-based structure-validation languages: DTDs, XML Schemas, RELAX, TREX, etc.", but RELAX and TREX were combined into RELAX NG years ago. Of greater personal interest is the fact that it carries over a bad link to my old Schematron/XSLT article. As I've corrected several times on the mailing list, that article is "Introducing the Schematron". Schematron.com also does not list two of my more recent articles:

Schematron.com does, however, include an entire page on ISO Schematron, including some sketchy updates I'm hoping to follow up on.

G. Ken Holman told me he created a modified version of the Schematron 1.5 reference XSLT implementation that allows the context of assertions to be attributes, not just elements. You can find his version linked from this message. I did point out to him that Scimitar (part of Amara) supports attributes as context, and overall attempts to be a fast and complete ISO Schematron implementation.

[Uche Ogbuji]

via Copia

"Tip: Use the right pattern for simple text in RELAX NG"

"Tip: Use the right pattern for simple text in RELAX NG"

The RELAX NG XML schema language allows you to say "permit some text here" in a variety of ways. Whether you're writing patterns for elements or attributes, it is important to understand the nuances between the different patterns for character data. In this tip, Uche Ogbuji discusses the basic foundations for text in RELAX NG.

Several times while working on RELAX NG in mentoring roles with clients I've had to explain some of the nuances in the various ways to express simple text patterns. In this article I lay out some of the most common distinctions I've had to make. I should say that much of what I know about RELAX NG nuances I learned from Eric van der Vlist and a lot of that wisdom is gathered in his RELAX NG book (in print or online). I recommend the print book because it has some nice additions not in the online version, and because Eric deserves to eat.

[Uche Ogbuji]

via Copia