A discussion about the brokenness of W3C XML Schema
(WXS) on XML-DEV turned
interestingly to the topic of the limitations of XML data bindings.
This thread crystallized into a truly bizarre subthread where we had
Mike Champion and Paul Downey actually trying to argue that the silly
WXS wart xsi:nil
might be more important in XML than mixed content
(honestly the arrogance of some of the XML
gentry just takes my breath
away). As usual it was Eric van der Vlist and Elliotte Harold patiently
arguing common sense, and at one point Pete Cordell asked
them:
How do you think a data binding app should handle mixed content? We
lump a complex types mixed content into a string and stop there, which I don't
think is ideal (although it is a common approach). Another approach could
be to have strings in your language binding classes (in our case C++)
interleaved with the data elements that would store the CDATA parts. Would
this be better? Is there a need for both?
Of course as author of Amara
Bindery, a Python data
binding, my response to this is "it's easy to handle mixed content."
Moving on in the thread he
elaborates:
Being guilty of being a code-head (and a binding one at that - can it
get worse!), I'm keen to know how you'd like us to make a better fist
of it. One way of binding the example of "<p>This is
<strong>very</strong> important</p>" might be to have a class structure
that (with any unused elements ignored) looks like:-
class p
{
string cdata1; // = "This is "
class strong strong;
string cdata2; // = " important"
};
class strong
{
string cdata1; // = "very"
};
as opposed to (ignoring the CDATA):
class p
{
class strong strong;
};
class strong
{
};
or (lumping all the mixed text together):
class p
{
string mixedContent; // = "<p>This is <strong>very</strong> important</p>"
};
Or do you just decide that binding isn't the right solution in this
case, or a hybrid is required?
It looks to me like a problem with poor expressiveness in a statically,
strongly typed language. Of course, static versus dynamic is a hot
topic these days, and has been since the "scripting language" diss has started to wear thin. But the simple fact is that Amara doesn't even blink at this, and needs a lot less superstructure:
>>> from amara.binderytools import bind_string
>>> doc = bind_string("<p>This is <strong>very</strong> important</p>")
>>> doc.p
<amara.bindery.p object at 0xb7bab0ec>
>>> doc.p.xml()
'<p>This is <strong>very</strong> important</p>'
>>> doc.p.strong
<amara.bindery.strong object at 0xb7bab14c>
>>> doc.p.strong.xml()
'<strong>very</strong>'
>>> doc.p.xml_children
[u'This is ', <amara.bindery.strong object at 0xb7bab14c>, u' important']
There's the magic. All the XML data is there; it uses the vocabulary of
the XML itself in the object model (as expected for a data binding); it
maintains the full structure of the mixed content in a very easy way for
the user to process. And if we ever decide we just want to content,
unmixed, we can just use the usual XPath technique:
>>> doc.p.xml_xpath(u"string(.)")
u'This is very important'
So there. Mixed content easily handled. Imagine my disappointment at the despairing responses of Paul Downey and
even Elliotte Harold:
Personally I'd stay away from data binding for use cases like this.
Dealing with mixed content is hardly the only problem. You also have to
deal with repeated elements, omitted elements, and order. Child elements
just don't work well as fields. You can of course fix all this, but then
you end up with something about as complicated as DOM.
Data binding is a plausible solution for going from objects and classes
to XML documents and schemas; but it's a one-way ride. Going the other
direction: from documents and schemas to objects and classes is much
more complicated and generally not worth the hassle.
As I hope my Amara example shows, you do not need to end up with
anything nearly as complex as DOM, and it's hardly a one-way ride. I
think it should be made clear that a lot of the difficulties that seem
to stem from Java's own limitations are not general XML processing
problems, and thus I do not think they should properly inform a problem
such as the emphasis of an XML schema language. In fact, I've [always
argued]() that it's the very marrying of XML technology to the
limitations of other technologies such as statically-typed OO languages
and relational DBMSes that results in horrors such as WXS and XQuery.
When designers focus on XML qua XML, as the RELAX NG folks did and the
XPath folks did, for example, the results tend to be quite superior.
Eric did point out
Amara in
the thread.
An interesting side note—a question about non-XHTML use cases of
mixed content (one even needs to ask?!) led once again to mention of the
most widely underestimated XML modeling problem of all time: the
structure of personal names. Peter
Gerstbach
provided the reminder this time. I've done my
bit in
the past.
[Uche Ogbuji]