A discussion about the brokenness of W3C XML Schema
(WXS) on XML-DEV turned
interestingly to the topic of the limitations of XML data bindings.
This thread crystallized into a truly bizarre subthread where we had
Mike Champion and Paul Downey actually trying to argue that the silly
WXS wart xsi:nil
might be more important in XML than mixed content
(honestly the arrogance of some of the XML
gentry just takes my breath
away). As usual it was Eric van der Vlist and Elliotte Harold patiently
arguing common sense, and at one point Pete Cordell asked
them:
How do you think a data binding app should handle mixed content? We lump a complex types mixed content into a string and stop there, which I don't think is ideal (although it is a common approach). Another approach could be to have strings in your language binding classes (in our case C++) interleaved with the data elements that would store the CDATA parts. Would this be better? Is there a need for both?
Of course as author of Amara Bindery, a Python data binding, my response to this is "it's easy to handle mixed content." Moving on in the thread he elaborates:
Being guilty of being a code-head (and a binding one at that - can it get worse!), I'm keen to know how you'd like us to make a better fist of it. One way of binding the example of "<p>This is <strong>very</strong> important</p>" might be to have a class structure that (with any unused elements ignored) looks like:-
class p { string cdata1; // = "This is " class strong strong; string cdata2; // = " important" }; class strong { string cdata1; // = "very" };
as opposed to (ignoring the CDATA):
class p { class strong strong; }; class strong { };
or (lumping all the mixed text together):
class p { string mixedContent; // = "<p>This is <strong>very</strong> important</p>" };
Or do you just decide that binding isn't the right solution in this case, or a hybrid is required?
It looks to me like a problem with poor expressiveness in a statically, strongly typed language. Of course, static versus dynamic is a hot topic these days, and has been since the "scripting language" diss has started to wear thin. But the simple fact is that Amara doesn't even blink at this, and needs a lot less superstructure:
>>> from amara.binderytools import bind_string >>> doc = bind_string("<p>This is <strong>very</strong> important</p>") >>> doc.p <amara.bindery.p object at 0xb7bab0ec> >>> doc.p.xml() '<p>This is <strong>very</strong> important</p>' >>> doc.p.strong <amara.bindery.strong object at 0xb7bab14c> >>> doc.p.strong.xml() '<strong>very</strong>' >>> doc.p.xml_children [u'This is ', <amara.bindery.strong object at 0xb7bab14c>, u' important']
There's the magic. All the XML data is there; it uses the vocabulary of the XML itself in the object model (as expected for a data binding); it maintains the full structure of the mixed content in a very easy way for the user to process. And if we ever decide we just want to content, unmixed, we can just use the usual XPath technique:
>>> doc.p.xml_xpath(u"string(.)") u'This is very important'
So there. Mixed content easily handled. Imagine my disappointment at the despairing responses of Paul Downey and even Elliotte Harold:
Personally I'd stay away from data binding for use cases like this. Dealing with mixed content is hardly the only problem. You also have to deal with repeated elements, omitted elements, and order. Child elements just don't work well as fields. You can of course fix all this, but then you end up with something about as complicated as DOM.
Data binding is a plausible solution for going from objects and classes to XML documents and schemas; but it's a one-way ride. Going the other direction: from documents and schemas to objects and classes is much more complicated and generally not worth the hassle.
As I hope my Amara example shows, you do not need to end up with anything nearly as complex as DOM, and it's hardly a one-way ride. I think it should be made clear that a lot of the difficulties that seem to stem from Java's own limitations are not general XML processing problems, and thus I do not think they should properly inform a problem such as the emphasis of an XML schema language. In fact, I've [always argued]() that it's the very marrying of XML technology to the limitations of other technologies such as statically-typed OO languages and relational DBMSes that results in horrors such as WXS and XQuery. When designers focus on XML qua XML, as the RELAX NG folks did and the XPath folks did, for example, the results tend to be quite superior.
Eric did point out Amara in the thread.
An interesting side note—a question about non-XHTML use cases of mixed content (one even needs to ask?!) led once again to mention of the most widely underestimated XML modeling problem of all time: the structure of personal names. Peter Gerstbach provided the reminder this time. I've done my bit in the past.