Learn how to invent XML languages, then do so

There has been a lot of chatter about Tim Bray's piece "Don’t Invent XML Languages". Good. I'm all for anything that makes people think carefully about XML language design and problems of semantic transparency (communicated meaning of XML structure). I'm all for it even though I generally disagree with Tim's conclusions. Here are some quick thoughts on Tim's essay, and some of the responses I've seen.

Here’s a radical idea: don’t even think of making your own language until you’re sure that you can’t do the job using one of the Big Five: XHTML, DocBook, ODF, UBL, and Atom.—Bray

This is a pretty biased list, and happens to make sense for the circles in which he moves. Even though I happen to move in much the same circles, the first thing I'd say is that there could hardly ever be an authoritative "big 5" list of XML vocabs. There is too much debate and diversity, and that's too good a thing to sweep under the rug. MS Office XML or ODF? OAGIS or UBL? RSS 2.0 or Atom? Sure I happen to plump for the latter three, as Tim does, but things are not so clear cut for the average punter. (I didn't mention TEI or DocBook because it's much less of a head to head battle).

I made my own list in "A survey of XML standards: Part 3—The most important vocabularies" (IBM developerWorks, 2004). It goes:

  • XHTML
  • Docbook
  • XSL-FO
  • SVG
  • VoiceXML
  • MathML
  • SMIL
  • RDF
  • XML Topic Maps

And in that article I admit I'm "just scratching the surface". The list predates first full releases of Atom and ODF, or they would have been on it. I should also mention XBEL, which is, I think, not as widely trumpetd, but just about as important as those other entrants. BTW, see the full cross-reference of my survey of XML standards.

Designing XML Languages is hard. It’s boring, political, time-consuming, unglamorous, irritating work. It always takes longer than you think it will, and when you’re finished, there’s always this feeling that you could have done more or should have done less or got some detail essentially wrong.—Bray

This is true. It's easy to be flip and say "sure, that's true of programming, but we're not being advised to write no more programs". But then I think this difficulty is even more true of XML design than of programming, and it's worth reminding people that a useful XML vocabulary is not something you toss off in the spare hour. Simon St.Laurent has always been a sound analyst of the harm done by programmers who take shortcuts and abuse markup in order to suite their conventions. The lesson, however, should be to learn best practices of markup design rather than to become a helpless spectator.

If you’re going to design a new language, you’re committing to a major investment in software development. First, you’ll need a validator. Don’t kid yourself that writing a schema will do the trick; any nontrivial language will have a whole lot of constraints that you can’t check in a schema language, and any nontrivial language needs an automated validator if you’re going to get software to interoperate.

If people would just use decent schema technology, this point would be very much weakened. Schema designers rarely see beyond plain W3C XML Schema or RELAX NG. Too bad. RELAX NG plus Schematron (with XPath 1.0/XSLT 1.0 drivers) covers a huge number of constraints. Add in EXSLT 1.0 drivers for Schematron and you can cover probably 95+% of Atom's constraints (probably more, actually). Throw in user-defined extensions and you have a very powerful and mostly declarative validation engine. We should do a better job of rendering such goodness to XML developers, rather than scaring them away with duct-tape-validator bogeymen.

Yes, XHTML is semantically weak and doesn’t really grok hierarchy and has a bunch of other problems. That’s OK, because it has a general-purpose class attribute and ignores markup it doesn’t know about and you can bastardize it eight ways from center without anything breaking. The Kool Kids call this “Microformats”...

This understated bit is, I think, the heart of Tim's argument. The problem is that I still haven't been able to figure out why Microformats have any advantage in Semantic transparency over new vocabularies. Despite the fuzzy claims of μFormatters, a microformat requires just as much specification as a small, standalone format to be useful. It didn't take me long kicking around XOXO to solve a real-world problem before this became apparent to me.

Some interesting reactions to the piece

Dare Obasanjo. Dare indirectly brought up that Ian Hickson had argued against inventing XML vocabularies in 2003. I remember violently and negatively reacting to the idea that everyone should stick to XHTML and its elite companions. Certainly such limitations make sense for some, but the general case is more nuanced (thank goodness). Side note: another pioneer of the pessimistic side of this argument is Mark Pilgrim http://www.xml.com/pub/au/164. Needless to say I disagree with many of his points as well.

I've always considered it a gross hack to think that instead of having an HTML web page for my blog and an Atom/RSS feed, instead I should have a single HTML page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it instead. However given that one of the inventors of XML (Tim Bray) is now advocating this approach, I wonder if I'm simply clinging to old ways and have become the kind of intellectual dinosaur I bemoan.—Obasanjo

Dare is, I think, about as stubborn and tart as I am, so I'm amazed to see him doubting his convictions in this way. Please don't, Dare. You're quite correct. Microformats are just a hair away from my pet reductio ad absurdum<tag type="title"> rather than just <title>. I still haven't heard a decent argument for such periphrasis. And I don't see how the fact that tag is semantically anchored does anything special for the stepchild identifier title in the microformats scenario.

BTW, there is a priceless quote in comments to Dare:

OK, so they're saying: don't create new XML languages - instead, create new HTML languages. Because if you can't get people to [separate presentation from data], hijack the presentation!—"Steve"

Wot he said. With bells on.

Danny Ayers .

I think most XML languages have been created by one of three processes - translating from a legacy format; mapping directly from the domain entities to the syntax; creating an abstract model from the domain, then mapping from that to the XML. The latter two of these are really on a greyscale: a language designer probably has the abstract entities and relationships in mind when creating the format, whether or not they have been expressed formally.—Ayers

I've had my tiffs with RDF gurus lately, but this is the sort of point you can trust an RDF guru to nail, and Danny does so. XML languages are, like all languages, about expression. The farther the expression lies from the abstraction being expressed (the model), the more expensive the maintenance. Punting to an existing format that might have some vague ties to the problem space is a much worse economic bet than the effort of designing a sound and true format for that problem space.

To slightly repurpose another Danny quote towards XML,

...in most cases it’s probably best to initially make up afresh a new representation that matches the domain model as closely as possible(/appropriate). Only then start looking to replacing the new terms with established ones with matching semantics. But don’t see reusing things as more important than getting an (appropriately) accurate model.—Ayers

Ned Batchelder. He correctly identifies that Tim Bray's points tend to be most applicable to document-style XML. I've long since come to the conclusion (again with a lot of influence from Simon St.Laurent) that XML is too often the wrong solution for programmer-data-focused formats (including software configuration formats). Yeah, of course I've already elaborated in the Python context.

[Uche Ogbuji]

via Copia

Misalignments with the planets

John Clark alerted me that Copia has been missing from Planet XML. I noticed it was also missing from PlanetPython. The Planet XML problem turned out to be because the XSLT used to convert the Atom feed to RSS 1.0 was written for Atom 0.3, and so stopped working when I upgraded Copia to Atom 1.0 late last year. I updated the XSLT and that's sorted out (as an unintended result I pwned that planet for a day or two). Planet Python uses a category feed in Atom from Copia, and I think the problem is that the version of Planet used in this aggregator does not yet support Atom 1.0. Planet XML uses its own aggregation software and has supported Atom 1.0 for a while.

There have been moves to update (see this message, for example). Now that FeedParser 4.0 is out with Atom 1.0 support, I expect most planets will start to correct their Atom deficiencies.

Meanwhile, John and I have been working with Sylvain Hellegouarch on yet another planet, using our own aggregation software. More on that later.

[Uche Ogbuji]

via Copia

" Process Atom 1.0 with XSLT"

"Process Atom 1.0 with XSLT"

Learn XSLT techniques for processing Atom documents. In this tutorial, author Uche Ogbuji shows how with real-world use cases. (free registration required)

Atom 1.0 is [the] Internet Engineering Task Force (IETF) standard for Web feeds -- information updates on Web site contents. Since Atom is an XML format, XSLT is a powerful tool for processing it. In this tutorial, Uche Ogbuji looks at XSLT techniques for processing Atom documents, addressing real-life use cases.

This tutorial shows you how to:

  • Navigate the basic structure of Atom 1.0 documents using XPath expressions
  • Use these expressions to drive XSLT transformations of Atom source files
  • Deal with the complications of text and markup embedded in Atom files You will also learn how to use XSLT templates to generate valid Atom files, and how to check the validity of the results.

A companion piece to my recent XML.com article "Handling Atom Text and Content Constructs", this is a task-driven tutorial, taking a more deliberate pace and focusing on XSLT.

developerWorks has had a lot to say about Atom lately, courtesy James Snell (who is also writing a lot of useful Atom extension drafts).

I guess how do you celebrate Atom's promotion to RFC 4287? Why by cooking up even more reading material.

[Uche Ogbuji]

via Copia

AJAX and the Back button

Sylvain and I have discussed recently his discomfort with Web browser state of the art in the age of AJAX (to use a grand term, even though I strongly believe that AJAX is nothing but an incremental gathering of conventions rather than anything new and special). He has gathered his thoughts in a blog posting "The chicken and egg problem". I posted a comment, but I thought I might copy the comment here as well.

[Let me summarize] in brief my reasons for thinking that the current system is not broken, and that we do not need to change anything fundamental about browsers.

First of all the basic semantic of "link history" in a Web browser has not changed since the Mosaic days for a very good reason: it is empirical to HTTP, REST and all that. At each point a browser is at a particular resource, and it moves from one resource to another according to actuation of simple REST verbs. Within each resource the browser can do all sorts of complex things, including showing animations (Shockwave, SVG, etc.), providing mini-applications to the user (Java applets, Flash, AJAX, etc.) and more, but the resource has not changed. The boundary of resource is defined by the service provider, and the browser simply reflects that in the history, URL bar and other features. I don't think the back and forward buttons should be overloaded for any operation within a resource. They should not be used as hot buttons in Flash apps or in AJAX apps. This violates the layering that is so important to the success of the Web.

If service providers want to provide navigation within a particular resource, they should do so within the application, and not at the REST level. I want my Front office app to have an "Undo" button (which makes much more sense than "Back"). [Why do I need chameleon browser chrome when I can just do <xforms:button id="undo"><xforms:caption>Undo</xforms:caption>...</xforms:button>?] When I click browser "Back" I want that to exit the application and go to the previous resource.

IMO People think they have trouble with the back button and Ajax because they do not appreciate protocol layering very well, and because the AJAX tools do not yet help in this understanding. I think a better understanding of this layering and better tools are what's needed, not a major redesign of the browser idiom.

[Uche Ogbuji]

via Copia

Agile Web #2: "Handling Atom Text and Content Constructs"

"Handling Atom Text and Content Constructs"

Uche Ogbuji's Agile Web column returns with a look at handling some of the trickier issues in the Atom Syndication Format, which has recently become RFC 4287, an internet standard.

Second article in my new column is out. In this one I focus on Atom text and content constructs. I spent more time on the Atom examples and less on the sample processing code, but I thought more of the former would be especially useful. I've been working with and writing about Atom a lot lately, and in fact I have an IBM developerWorks tutorial for Atom processing in XSLT in production. It should be live some time today.

Joe Gregorio has been working the other half of the Atom pie (old joke for folks who've been following Atom), and he has a very timely new article out: "Catching Up with the Atom Publishing Protocol".

And once again, if you'd like to discuss Atom (syntax or publishing protocol), please do join us on the #atom channel on irc.freenode.net.

[Uche Ogbuji]

via Copia

XSLT for converting from OPML to XBEL and XOXO

In all this Web feed hacking I've been working with my list originally exported from Lektora in OPML format. I wrote XSLT to convert from OPML to XBEL and XOXO. In the case of XOXO I really couldn't figure out any common conventions for Web feeds so I made up my own for now. The resulting XBEL looks a lot easier to work with, so I'm propose extensions for feed URL / site URL coupling in the renewal of XBEL. I figured my XSLT might be useful to others, so here are the links:

Going from XBEL to OPML, I've been using Dan MacTough's XSLT. (He also has an XBEL to XHTML transform). I sometimes have to tweak the resulting attributes to deal with xmlUrl/url and title/text type OPML madness.

I've also posted my Web feed list in XBEL form. It uses old school XBEL 1.0, and not any of the metadata additions I'm hoping to see in 1.2. As such, it's only a list of Web feeds and doesn't include the corresponding Weblog home pages.

[Uche Ogbuji]

via Copia

XML Bookmark Exchange Language (XBEL) gets a proper home

XML Bookmark Exchange Language (XBEL)

The Python XML SIG has had some really great times in its history. One of the highlights is the development of XML Bookmark Exchange Language (XBEL). In September of 1998, just as I was joining the group, they were developing this bookmarks exchange language that's still used in more browsers and bookmark management projects than any other particular format. The XML-SIG has fallen on quiet times, and one of the side effects of this is that additional work on XBEL has been neglected.

Earlier this year we agreed on the SIG to give XBEL its own home on SourceForge, but no one stepped up to make it happen, until John L. Clark got to it last week (thanks, John).

XBEL's new home is http://sourceforge.net/projects/xbel/. The old home is still up, but I think we should move it to http://xbel.sourceforge.net/, with some updates and maybe a design update (maybe make the page XHTML). We'll be discussing such things on the new XBEL mailing list, so please come join us. The main goal is to add more features to XBEL needed for its original role in browser bookmarks exchange, but I'm also interested in making it a useful format for general Web resource lists such as feed lists (e.g. a superior alternative to OPML).

John wrote up a good summary of recent discussions of XBEL.

I'll have more on our efforts summarized here on Copia as we progress.

[Uche Ogbuji]

via Copia

Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia

Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia

Copia housecleaning

I've finally had some time today, as I prepare for the holidays, to fix some things on Copia that have been broken for too long. Some of the highlights, especially concerning issues mentioned by readers (thanks, guys), are:

RSS 1.0 feed body fix. Added rss:description field for the RSS 1.0 feed, which fixes missing post bodies in readers such as Bloglines which don't support support content:encoded. I do truncate the field to 500 characters, according to the recommendation in the spec.

Single entry view title fix. Added entry titles for single entry pages. Before today, if you viewed this entry through the perma-link, the title would just say "Copia"; now it says "Copia ✏Copia housecleaning"). I've wanted to do this for a while, but I was having the devil of a time figuring out how to do it with PyBlosxom. A scolding from Dan Connolly forced me to chase down a fix. For other PyBlosxom users the trick is to use the comments plug-in, copy the head.* flavor file to comment-head.*, and then update to use the $title variable, which is the title of the entry itself ($blog-title is the title of the entire blog). In my case the updated HTML header template looks like:

<title>$blog_title &#x270F;$title</title>

I did get a report that Copia is incorrectly sending `Content-Type header text/html;charset=ISO-8859-1`, but when I check using the LiveHTTPHeaders extension for FireFox on Linux it reports the correct charset=UTF-8 from the server. If anyone else can corroborate this issue, please leave a comment with the specific URL from which you noticed the error, your platform and browser, and the HTTP sniffing tool were you using. Thanks.

[Uche Ogbuji]

via Copia