Validation on the rack

Mark Baker's "Validation considered harmful" touched off a fun series of responses.

[C]onsider the scenario of two parties on the Web which want to exchange a certain kind of document. Party A has an expensive support contract with BigDocCo that ensures that they’re always running the latest-and-greatest document processing software. But party B doesn’t, and so typically lags a few months behind. During one of those lags, a new version of the schema is released which relaxes an earlier stanza in the schema which constrained a certain field to the values “1″, “2″, or “3″; “4″ is now a valid value. So, party B, with its new software, happily fires off a document to A as it often does, but this document includes the value “4″ in that field. What happens? Of course A rejects it; it’s an invalid document, and an alert is raised with the human [administrator], dramatically increasing the cost of document exchange. All because evolvability wasn’t baked in, because a schema was used in its default mode of operation; to restrict rather than permit.

Upon reading this I had 2 immediate reactions:

  1. Yep. Walter Perry was going on about all this sort of thing a long time ago, and the industry would be in a much saner place, without, for example crazy ideas such as WS-Kaleidoscope and tight binding of documents to data records (read WXS and XQuery). For an example of how Perry absolutely skewered class-conscious XML using a scenario somewhat similar to Mark's, read this incisive post. To me the perils of bondage-and-discipline validation are as those of B&D datatyping. It's all more example of the poor design that results when you follow twopenny Structured Programming too far and let early binding rule the cosmos.

  2. Yep. This is one of the reasons why once you use Schematron and actually deploy it in real-life scenarios where schema evolution is inevitable, you never feel sanguine about using a grammar-based schema language (not even RELAX NG) again.

Dare's response took me aback a bit.

The fact that you enforce that the XML documents you receive must follow a certain structure or must conform to certain constraints does not mean that your system cannot be flexible in the face of new versions. First of all, every system does some form of validation because it cannot process arbitrary documents. For example an RSS reader cannot do anything reasonable with an XBRL or ODF document, no matter how liberal it is in what it accepts. Now that we have accepted that there are certain levels validation that are no-brainers the next question is to ask what happens if there are no constraints on the values of elements and attributes in an input document. Let's say we have a purchase order format which in v1 has a element which can have a value of "U.S. dollars" or "Canadian dollars" then in v2 we now support any valid currency. What happens if a v2 document is sent to a v1 client? Is it a good idea for such a client to muddle along even though it can't handle the specified currency format?

Dare is not incorrect, but I was surprised at his reading of Mark. When I considered it carefully, though, I realized that Mark did leave himself open to that interpretation by not being explicit enough. As he clarified in comment to Dare:

The problem with virtually all uses of validation that I've seen is that this document would be rejected long before it even got to the bit of software which cared about currency. I'm arguing against the use of validation as a "gatekeeper", not against the practice of checking values to see whether you can process them or not ... I thought it goes without saying that you need to do that! 8-O

I actually think this is a misunderstanding that other readers might easily have, so I think it's good that Dare called him on it, and teased out the needed clarification. I missed it because I know Mark too well to ever imagine he'd ever go so far off in the weeds.

Of course the father of Schematron would have a response to reckon with in such debate, but I was surprised to find Rick Jelliffe so demure about Schematron. His formula:

schema used to validating incoming data only validates traceable business requirements

Is flash-bam-alakazam spot on, but somewhat understated. Most forms of XML validation do us disservice by making us nit-pick every detail of what we can live with, rather than letting us make brief declarations of what we cannot live without. Yes Schematron's phases provide a powerful mechanism for elagant modularization of expression of rules and requirements, but long before you go that deep Schematron sets you free by making validation an open rather than a closed operation. The gains in expressiveness thereby provided are near astonishing, and this is despite the fact that Schematron is a less terse schema language than DTD, WXS or RELAX NG.

Of course XML put us on the road to unnecessary bondage and discipline on day one when it made it so easy, and even a matter of recommendation, to top each document off with a lordly DTD. Despite the fact that I think Microformats are a rickety foundation for almost anything useful at Web scale, I am hopeful they will act as powerful incentive for moving the industry away from knee-jerk validation.

[Uche Ogbuji]

via Copia
2 responses
Validation is one of those tricky questions that comes into play.  In a typical B2B type scenario, where one is using an vertical industry standard specification, validation of the structure of the document is typically recommended.  Now how and when you do that validation really comes down to how much risk you can put up with.  Personally, I recommend validation of an industry standard schema on both the inbound and outbound points.  Too many tools still do not produce valid Schema when it is generated from a data binding framework, and unfortunately, that is the direction much of the industry has taken for better or worse.



I do think that leveraging tools like Schematron for enforcing business specific requirements on an industry standard schema after the structure has been validated according to the industry specification.  I realize the micro formats, and the extension are necessary for many scenarios, but this is also an issue that can and should be addressed at the level of the organizations.  In particular, B2B specifications like UBL, HR-XML, etc, take way to long to be released and developed.  Those specifications need to be released sooner and more often than they are.  Otherwise, it requires users to extend them, and this creates multiple one off specifications.  End users of these specifications also should take back their requirements and extensions to those organizations so that they can be included in the specification.  If they don't, then it just perpetuates the problem.



Dealing with multiple extensions to a industry specification is a maintenance night mare.  It isn't so bad if you are dealing with a small group of trading partners, but it gets worse the larger the group becomes.



Micro formats have their place, and validation has it's place.  How much, and what is the minimal validation that should occur, depends greatly on what the intended application is.  Knee Jerk validation does need to stop, but validation still should be done.
David,



You're spot on.  Thanks for your bit of clarification.  For sure industry standards are important to validate.  I think that represents the first layer of business rules that need to be traced in most practical scenarios.  I do think that there is no reason not to use Schematron as the validation tool for the industry standards themselves, but regardless of what schema mechanism is used, I agree that one should try to verify a contract using normative means whenever possible.



I do want to point out that its not every industry standard that's too slow to update.  cXML has the opposite problem in that there have been far too many minor version updates, and this is the source

of real pain in one of my current day job tasks.  I think that a more open approach to validation would have reduced a lot of this pain.



And finally, I agree strongly that you can never use something as lightweight as a Microformat to do the job of an industrial format.  Then again, I'm not even sure Microformats are always a good tool for lightweight tasks, either :-)