Planet Atom's Information Pipeline

The hosting of Planet Atom has moved (perhaps temporarily) over to Athena: Christopher Schmidt's excellent hosting service.
Metacognition is hosted there. In addition, the planetatom source was extended to support additional representational modes for the aggregated Atom feed: GRDDL, RDFa, Atom OWL RDF/XML (via content negotiation), and Exhibit.

The latter was the subject of my presentation at XTech 2007. As I mentioned during my session, you can go to http://planetatom.net/exhibit to see the live faceted-browsing of the aggregated Atom feed. An excerpt from the Planet Atom front page describes the nature of the project:

Planet Atom focuses Atom streams from authors with an affinity for syndication and Atom-specific issues. This site was developed by Sylvain Hellegouarch, Uche Ogbuji, John L. Clark, and Chimezie Ogbuji

I wrote previously (in my XML 2006 paper) on this methodology of splicing out multiple (disjoint) representations from a single XML source and the Exhibit mode is yet another facet: specifically for quick, cross-browser, filtering of the aggregated feed.

Planet Atom Pipeline

The Planet Atom pipleline as a whole is pretty interesting. First an XBEL bookmark document is used as the source for aggregation. RESTful caching minimizes load on the sources during aggregation. The aggregation groups the entries by calendar groups (months and days). The final aggregated feed is then sent through several XML pipelines to generate the JSON which drives the Exhibit view, an HTML version of the front page, an XHTML version of the front page (one of the prior two is served to the client depending on the kind of the agent which requested the front page), and an RDF/XML serialization of the feed expressed in Atom OWL.

Note in the diagram that a Microformat approach could have been used instead to embed the Atom OWL statements. RDFa was used instead as it was much easier to encode the statements in a common language and not contend with adding profiles for each Microformat used. Elias's XTech 2007 presentation touched a bit on this contrast between the two approaches. In either case, GRDDL is used to extract the RDF.

These representations are stored statically at the server and served appropriately via a simple CherryPy module As mentioned earlier, the XHTML front page now embeds the Atom OWL assertions about the feed (as well as assertions about the sources, their authors, and the Planetatom developers) in RDFa and includes hooks for a GRDDL-aware Agent to extract a faithful rendition of the feed in RDF/XML. The same XML pipeline which extracts the RDF/XML from the aggregated feed is identified as a GRDDL transform. So, the RDF can either be fetched via content negotiation or by explicit GRDDL processing.

Unfortunately, the RDFa is broken. RDFa can be extracted by an explicit parser (which is how Elias Torrez's Python-based RDFa parser, his recent work on operator, and Ben Adida's RDFa bookmarklets ) or via XSLT as part of a GRDDL mechanism. Due to a quirk in the way RDFa uses namespace declarations (which may or may not be a necessary evil ), the various vocabularies used in the resulting RDF/XML are not properly expanded from the CURIES to their full URI form. I mentioned this quirk to Steven Pemberton.

As it happens, the stylesheet which transforms the aggregated Atom feed into the XHTML host document defines the namespace declarations:

xmlns:dc="http://purl.org/dc/elements/1.1/" 
  xmlns:foaf="http://xmlns.com/foaf/0.1/" 
  xmlns:aOwl="http://bblfish.net/work/atom-owl/2006-06-06/#" 
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

However, since there are not any elements which use QNames formed from these declarations, they are not included in the XSLT result! This trips up the RDF -> RDF/XML transformation (written by Fabien Gandon, a fellow GRDDL WG member) and results in RDF statements where the URIs are simply the CURIEs as originally expressed in RDFa. This seems to only be a problem for XSLT processors which explicitly strip out the unused namespace declarations. They have a right to do this as it has no effect on the underlying infoset. Something for the RDF-in-XHTML task group to consider, especially for scenarios such as this where the XHTML+RDFa is not hand-crafted but produced from an XML pipeline.

[Uche Ogbuji]

via Copia

Atom Feed Semantics

Not a lot of people outside the core Semantic Web community actually want to create RDF, but extracting it from what's already there can be useful for a wide variety of projects. (RSS and Atom are first and relatively easy steps that direction.)

Terminal dump

chimezie@Zion:~/devel/grddl-hg$ python GRDDL.py --debug --output-format=n3 --zone=https:--ns=aowl=http://bblfish.net/work/atom-owl/2006-06-06/# --ns=iana=http://www.iana.org/assignments/relation/ --ns=some-blog=http://example.org/2003/12/13/  https://sommer.dev.java.net/atom/2006-06-06/transform/atom-grddl.xml
binding foaf to http://xmlns.com/foaf/0.1/
binding owl to http://www.w3.org/2002/07/owl#
binding iana to http://www.iana.org/assignments/relation/
binding rdfs to http://www.w3.org/2000/01/rdf-schema#
binding wot to http://xmlns.com/wot/0.1/
binding dc to http://purl.org/dc/elements/1.1/
binding aowl to http://bblfish.net/work/atom-owl/2006-06-06/#
binding rdf to http://www.w3.org/1999/02/22-rdf-syntax-ns#
binding some-blog to http://example.org/2003/12/13/
Attempting a comprehensive glean of  https://sommer.dev.java.net/atom/2006-06-06/transform/atom-grddl.xml
@@fetching:  https://sommer.dev.java.net/atom/2006-06-06/transform/atom-grddl.xml
@@ignoring types: ('application/rdf+xml', 'application/xml', 'text/xml', 'application/xhtml+xml', 'text/html')
applying transformation https://sommer.dev.java.net/atom/2006-06-06/transform/atom2turtle_xslt-1.0.xsl
@@fetching:  https://sommer.dev.java.net/atom/2006-06-06/transform/atom2turtle_xslt-1.0.xsl
@@ignoring types: ('application/xml',)
Parsed 22 triples as Notation 3
Attempting a comprehensive glean of  http://www.w3.org/2005/Atom

Via atom2turtle_xslt-1.0.xslt and Atom OWL: The GRDDL result document:

@prefix aowl: <http://bblfish.net/work/atom-owl/2006-06-06/#>.
@prefix iana: <http://www.iana.org/assignments/relation/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix some-blog: <http://example.org/2003/12/13/>.
[ a aowl:Feed;
     aowl:author [ a aowl:Person;
             aowl:name "John Doe"];
     aowl:entry [ a aowl:Entry;
             aowl:id "urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a"^^<http://www.w3.org/2001/XMLSchema#anyURI>;
             aowl:link [ a aowl:Link;
                     aowl:rel iana:alternate;
                     aowl:to [ aowl:src some-blog:atom03]];
             aowl:title "Atom-Powered Robots Run Amok";
             aowl:updated "2003-12-13T18:30:02Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>];
     aowl:id "urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6"^^<http://www.w3.org/2001/XMLSchema#anyURI>;
     aowl:link [ a aowl:Link;
             aowl:rel iana:alternate;
             aowl:to [ aowl:src <http://example.org/>]];
     aowl:title "Example Feed";
     aowl:updated "2003-12-13T18:30:02Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>].

Planet Atom's feed

@prefix : <http://bblfish.net/work/atom-owl/2006-06-06/#> .
 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
 @prefix foaf: <http://xmlns.com/foaf/0.1/> .
 @prefix iana: <http://www.iana.org/assignments/relation/> .
 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
[] a :Feed ;
:id "http://planetatom.net/"^^xsd:anyURI;
:title "Planet Atom" ;
:updated "2006-12-10T06:57:54.166890Z"^^xsd:dateTime;
:generator [ a :Generator;
            :uri <>;
            :generatorVersion "";
            :name """atomixlib"""];
 :entry [  a :Entry;
           :title "The Darfur Wall" ;
           :author [ a :Person; :name "James Tauber"] ;
           :link [ a :Link;
                     :rel iana:alternate ;
                     :to [ :src <http://jtauber.com/blog/2006/12/10/the_darfur_wall>;]          
           ];
:updated "2006-12-10T00:13:34Z"^^xsd:dateTime;
:published "2006-12-10T00:13:34Z"^^xsd:dateTime;
:id "http://jtauber.com/blog/2006/12/10/the_darfur_wall"^^xsd:anyURI; ]

[Uche Ogbuji]

via Copia

A univesal feed -> RDF mapping for Emeka

I found a nice mapping from Universal Feed Parser to RDF (SKOS,DC,AtomOWL), that Emeka will employ:

Each entry is an instance of (atomOwl:Entry,rss:item)

  • The URL of the feed -> an instance of atomOwl:Feed
  • Feed - atomOwl:entry -> entries
  • entry (link or id as URI) - rdfs:label,skos:prefLabel,dc:title -> entry.title
  • entry - dc:description,atomOwl:summary,rdfs:comment -> entry.summary
  • entry,feed - dc:creator, foaf:maker -> foaf:Person
  • entry.author_detail.name -> foaf:name
  • entry.author_detail.email -> foaf:mbox
  • entry.author_detail.href is the URL of the author
  • entries.tags -> skos:Collection
  • entries.tags.label -> skos:prefLabel
  • entries.tags.scheme + entries.tags.term (URI resolution) -> URI of skos:Concept
  • entry - dc:created,dc:date,atomOwl:published -> entry.published

Chimezie Ogbuji

via Copia

Learn how to invent XML languages, then do so

There has been a lot of chatter about Tim Bray's piece "Don’t Invent XML Languages". Good. I'm all for anything that makes people think carefully about XML language design and problems of semantic transparency (communicated meaning of XML structure). I'm all for it even though I generally disagree with Tim's conclusions. Here are some quick thoughts on Tim's essay, and some of the responses I've seen.

Here’s a radical idea: don’t even think of making your own language until you’re sure that you can’t do the job using one of the Big Five: XHTML, DocBook, ODF, UBL, and Atom.—Bray

This is a pretty biased list, and happens to make sense for the circles in which he moves. Even though I happen to move in much the same circles, the first thing I'd say is that there could hardly ever be an authoritative "big 5" list of XML vocabs. There is too much debate and diversity, and that's too good a thing to sweep under the rug. MS Office XML or ODF? OAGIS or UBL? RSS 2.0 or Atom? Sure I happen to plump for the latter three, as Tim does, but things are not so clear cut for the average punter. (I didn't mention TEI or DocBook because it's much less of a head to head battle).

I made my own list in "A survey of XML standards: Part 3—The most important vocabularies" (IBM developerWorks, 2004). It goes:

  • XHTML
  • Docbook
  • XSL-FO
  • SVG
  • VoiceXML
  • MathML
  • SMIL
  • RDF
  • XML Topic Maps

And in that article I admit I'm "just scratching the surface". The list predates first full releases of Atom and ODF, or they would have been on it. I should also mention XBEL, which is, I think, not as widely trumpetd, but just about as important as those other entrants. BTW, see the full cross-reference of my survey of XML standards.

Designing XML Languages is hard. It’s boring, political, time-consuming, unglamorous, irritating work. It always takes longer than you think it will, and when you’re finished, there’s always this feeling that you could have done more or should have done less or got some detail essentially wrong.—Bray

This is true. It's easy to be flip and say "sure, that's true of programming, but we're not being advised to write no more programs". But then I think this difficulty is even more true of XML design than of programming, and it's worth reminding people that a useful XML vocabulary is not something you toss off in the spare hour. Simon St.Laurent has always been a sound analyst of the harm done by programmers who take shortcuts and abuse markup in order to suite their conventions. The lesson, however, should be to learn best practices of markup design rather than to become a helpless spectator.

If you’re going to design a new language, you’re committing to a major investment in software development. First, you’ll need a validator. Don’t kid yourself that writing a schema will do the trick; any nontrivial language will have a whole lot of constraints that you can’t check in a schema language, and any nontrivial language needs an automated validator if you’re going to get software to interoperate.

If people would just use decent schema technology, this point would be very much weakened. Schema designers rarely see beyond plain W3C XML Schema or RELAX NG. Too bad. RELAX NG plus Schematron (with XPath 1.0/XSLT 1.0 drivers) covers a huge number of constraints. Add in EXSLT 1.0 drivers for Schematron and you can cover probably 95+% of Atom's constraints (probably more, actually). Throw in user-defined extensions and you have a very powerful and mostly declarative validation engine. We should do a better job of rendering such goodness to XML developers, rather than scaring them away with duct-tape-validator bogeymen.

Yes, XHTML is semantically weak and doesn’t really grok hierarchy and has a bunch of other problems. That’s OK, because it has a general-purpose class attribute and ignores markup it doesn’t know about and you can bastardize it eight ways from center without anything breaking. The Kool Kids call this “Microformats”...

This understated bit is, I think, the heart of Tim's argument. The problem is that I still haven't been able to figure out why Microformats have any advantage in Semantic transparency over new vocabularies. Despite the fuzzy claims of μFormatters, a microformat requires just as much specification as a small, standalone format to be useful. It didn't take me long kicking around XOXO to solve a real-world problem before this became apparent to me.

Some interesting reactions to the piece

Dare Obasanjo. Dare indirectly brought up that Ian Hickson had argued against inventing XML vocabularies in 2003. I remember violently and negatively reacting to the idea that everyone should stick to XHTML and its elite companions. Certainly such limitations make sense for some, but the general case is more nuanced (thank goodness). Side note: another pioneer of the pessimistic side of this argument is Mark Pilgrim http://www.xml.com/pub/au/164. Needless to say I disagree with many of his points as well.

I've always considered it a gross hack to think that instead of having an HTML web page for my blog and an Atom/RSS feed, instead I should have a single HTML page with <div class="rss:item"> or <h3 class="atom:title"> embedded in it instead. However given that one of the inventors of XML (Tim Bray) is now advocating this approach, I wonder if I'm simply clinging to old ways and have become the kind of intellectual dinosaur I bemoan.—Obasanjo

Dare is, I think, about as stubborn and tart as I am, so I'm amazed to see him doubting his convictions in this way. Please don't, Dare. You're quite correct. Microformats are just a hair away from my pet reductio ad absurdum<tag type="title"> rather than just <title>. I still haven't heard a decent argument for such periphrasis. And I don't see how the fact that tag is semantically anchored does anything special for the stepchild identifier title in the microformats scenario.

BTW, there is a priceless quote in comments to Dare:

OK, so they're saying: don't create new XML languages - instead, create new HTML languages. Because if you can't get people to [separate presentation from data], hijack the presentation!—"Steve"

Wot he said. With bells on.

Danny Ayers .

I think most XML languages have been created by one of three processes - translating from a legacy format; mapping directly from the domain entities to the syntax; creating an abstract model from the domain, then mapping from that to the XML. The latter two of these are really on a greyscale: a language designer probably has the abstract entities and relationships in mind when creating the format, whether or not they have been expressed formally.—Ayers

I've had my tiffs with RDF gurus lately, but this is the sort of point you can trust an RDF guru to nail, and Danny does so. XML languages are, like all languages, about expression. The farther the expression lies from the abstraction being expressed (the model), the more expensive the maintenance. Punting to an existing format that might have some vague ties to the problem space is a much worse economic bet than the effort of designing a sound and true format for that problem space.

To slightly repurpose another Danny quote towards XML,

...in most cases it’s probably best to initially make up afresh a new representation that matches the domain model as closely as possible(/appropriate). Only then start looking to replacing the new terms with established ones with matching semantics. But don’t see reusing things as more important than getting an (appropriately) accurate model.—Ayers

Ned Batchelder. He correctly identifies that Tim Bray's points tend to be most applicable to document-style XML. I've long since come to the conclusion (again with a lot of influence from Simon St.Laurent) that XML is too often the wrong solution for programmer-data-focused formats (including software configuration formats). Yeah, of course I've already elaborated in the Python context.

[Uche Ogbuji]

via Copia

Misalignments with the planets

John Clark alerted me that Copia has been missing from Planet XML. I noticed it was also missing from PlanetPython. The Planet XML problem turned out to be because the XSLT used to convert the Atom feed to RSS 1.0 was written for Atom 0.3, and so stopped working when I upgraded Copia to Atom 1.0 late last year. I updated the XSLT and that's sorted out (as an unintended result I pwned that planet for a day or two). Planet Python uses a category feed in Atom from Copia, and I think the problem is that the version of Planet used in this aggregator does not yet support Atom 1.0. Planet XML uses its own aggregation software and has supported Atom 1.0 for a while.

There have been moves to update (see this message, for example). Now that FeedParser 4.0 is out with Atom 1.0 support, I expect most planets will start to correct their Atom deficiencies.

Meanwhile, John and I have been working with Sylvain Hellegouarch on yet another planet, using our own aggregation software. More on that later.

[Uche Ogbuji]

via Copia

" Process Atom 1.0 with XSLT"

"Process Atom 1.0 with XSLT"

Learn XSLT techniques for processing Atom documents. In this tutorial, author Uche Ogbuji shows how with real-world use cases. (free registration required)

Atom 1.0 is [the] Internet Engineering Task Force (IETF) standard for Web feeds -- information updates on Web site contents. Since Atom is an XML format, XSLT is a powerful tool for processing it. In this tutorial, Uche Ogbuji looks at XSLT techniques for processing Atom documents, addressing real-life use cases.

This tutorial shows you how to:

  • Navigate the basic structure of Atom 1.0 documents using XPath expressions
  • Use these expressions to drive XSLT transformations of Atom source files
  • Deal with the complications of text and markup embedded in Atom files You will also learn how to use XSLT templates to generate valid Atom files, and how to check the validity of the results.

A companion piece to my recent XML.com article "Handling Atom Text and Content Constructs", this is a task-driven tutorial, taking a more deliberate pace and focusing on XSLT.

developerWorks has had a lot to say about Atom lately, courtesy James Snell (who is also writing a lot of useful Atom extension drafts).

I guess how do you celebrate Atom's promotion to RFC 4287? Why by cooking up even more reading material.

[Uche Ogbuji]

via Copia

Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia

Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia

Copia housecleaning

I've finally had some time today, as I prepare for the holidays, to fix some things on Copia that have been broken for too long. Some of the highlights, especially concerning issues mentioned by readers (thanks, guys), are:

RSS 1.0 feed body fix. Added rss:description field for the RSS 1.0 feed, which fixes missing post bodies in readers such as Bloglines which don't support support content:encoded. I do truncate the field to 500 characters, according to the recommendation in the spec.

Single entry view title fix. Added entry titles for single entry pages. Before today, if you viewed this entry through the perma-link, the title would just say "Copia"; now it says "Copia ✏Copia housecleaning"). I've wanted to do this for a while, but I was having the devil of a time figuring out how to do it with PyBlosxom. A scolding from Dan Connolly forced me to chase down a fix. For other PyBlosxom users the trick is to use the comments plug-in, copy the head.* flavor file to comment-head.*, and then update to use the $title variable, which is the title of the entry itself ($blog-title is the title of the entire blog). In my case the updated HTML header template looks like:

<title>$blog_title &#x270F;$title</title>

I did get a report that Copia is incorrectly sending `Content-Type header text/html;charset=ISO-8859-1`, but when I check using the LiveHTTPHeaders extension for FireFox on Linux it reports the correct charset=UTF-8 from the server. If anyone else can corroborate this issue, please leave a comment with the specific URL from which you noticed the error, your platform and browser, and the HTTP sniffing tool were you using. Thanks.

[Uche Ogbuji]

via Copia