Processing "Web 2.0" using XSLT document() variants? No thanks.

Mark Nottingham has written an intriguing piece "XSLT for the Rest of the Web". It's drummed up some interest, some of which has even leaked into the 4Suite mailing list thanks to the energetic Sylvain Hellegouarch. Mark says:

I’ve raved before about how useful the XSLT document() function is, once you get used to it. However, the stars have to be aligned just so to use it; the Web site can’t use cookies for anything important, and the content you’re interested in has to be available in well-formed XML.

He goes on to present a set of extension functions he's created for libxslt. They are basically smarter document() functions that can do fancy Web things, including HTTP POST, and using HTML Tidy to grab tag soup HTML as XHTML.

As I read through it, I must say my strong impression was "been there, done that, probably never looking back". Certainly no diss of Mark intended there. He's one of the sharper hackers I know. I guess we're just at different points in our thinking of where XSLT fits into the Web-savvy apps toolkit.

First of all, I think the Web has more dragons than you could easily tame with even the mightiest XSLT extension hackery. I think you need general-purpose programming language to wrangle "Web 2.0" without drowning in tears.

More importantly, if I ever needed XSLT's document() function to process anything more than it's spec'ed to, I would consider that a pretty strong indicator that it's time to rethink part of my application architecture.

You see, I used to be a devotee of XSLT all over the place, and XSLT extensions for just about every limitation of the language. Heck, I wrote a whole framework of such things into 4Suite Repository. I've since reformed. These days I take the pipeline approach to such processing, and I keep XSLT firmly in the narrow niche for which it was designed. I have more on this evolution of thinking in "Lifting XSLT into application domain with extension functions?".

But back to Mark's idea. I actually implemented 4Suite XSLT extensions to use HTTP POST and to tidy tag soup HTML into XHTML, but I wouldn't dream of using these extensions any more. Nowadays, I use Python to gather and prepare data into a model representation that I then hand over to XSLT for pure presentation processing. Complex logical tasks such as accessing Web data beyond trivially fetched XML are matters for the model layer, and not the presentation logic. For example, if I need to tidy something, I tidy it at the Python level and put what I need of the resulting XHTML into the model XML before passing it to XSLT. I use Amara XML Toolkit with John Cowan's TagSoup for my tidying needs. I prefer TagSoup rather than tidy because I find it's faster and more robust.

Even if you use the libxml2 family of tools, I still think it's better to use libxml, and perhaps the libxml HTML parser to do the model processing and hand over resulting XML to libxslt in a separate step.

XSLT is pretty cool, but these days rather than reproduce all of Python's dozens of Web processing libraries therein, I plump for Python itself.

[Uche Ogbuji]

via Copia

"Tip: Computing word count in XML documents" pubbed

"Tip: Computing word count in XML documents"

XML is text and yet more than just text -- sometimes you want to work with just the content rather than the tags and other markup. In this tip, Uche Ogbuji demonstrates simple techniques for counting the words in XML content using XSLT with or without additional tools.

It was just a few weeks after I sent the manuscript to the editor that this thread started up on XML-DEV. Spooky timing.

[Uche Ogbuji]

via Copia

Dare's XLINQ examples in Amara

Dare's examples for XLINQ are interesting. They are certainly more streamlined than the usual C# and Java fare I see, but still a bit clunky compared to what I'm used to in Python. To be fair a lot of that is on the C# language, so I'd be interested in seeing what XLINK looks like from Python.NET or Boo.

The following is my translation from Dare's fragments into corresponding Amara fragments (compatible with the Amara 1.2 branch).

'1. Creating an XML document'

import amara
#Done in 2 chunks just to show the range of options
#Another way would be to start with amara.create_document
skel = '<!--XLinq Contacts XML Example--><?MyApp 123-44-4444?><contacts/>'
doc = amara.parse(skel)
  <name>Patrick Hines</name>
    <street1>123 Main St</street1>
    <city>Mercer Island</city>

'2. Creating an XML element in the "" namespace'

doc.xml_create_element(u'contacts', u'')

'3. Loading an XML element from a file'


'4. Writing out an array of Person objects as an XML file'

persons = {}
persons[u'Patrick Hines'] = [u'206-555-0144', u'425-555-0145']
persons[u'Gretchen Rivas'] = [u'206-555-0163']
for name in persons:
    for phone in persons[name]:
print doc.xml()

'5. Print out all the element nodes that are children of the <contact> element'

for c in contact.xml_child_elements():
    print c.xml()

'6. Print all the <phone> elements that are children of the <contact> element'

for c in contact.xml_xpath(u'phone'):
    print c.xml()

'7. Adding a <phone> element as a child of the <contact> element'


'8. Adding a <phone> element as a sibling of another <phone> element'

mobile = contacts.xml_create_element(u'phone', content=u'206-555-0168')
first =
contacts.xml_insert_after(first, mobile)

'9. Adding an <address> element as a child of the <contact> element'

contacts.xml_append_fragment("""  <address>
    <street1>123 Main St</street1>
    <city>Mercer Island</city>

'10. Deleting all <phone> elements under a <contact> element'

for p in contact.xml_remove_child(p)

'11. Delete all children of the <address> element which is a child of the <contact> element'

'12. Replacing the content of the <phone> element under a <contact> element'

#Not really necessary: just showing how to clear the content = u'425-555-0155'

'13. Alternate technique for replacing the content of the <phone> element under a <contact> element' = u'425-555-0155'

'14. Creating a contact element with attributes multiple phone number types'

#I'm sure it's clear by now how easy this would be with xml_append_fragment
#So here is the more analogous API approach
contact = contacts.xml_create_element(u'contact')
contact.xml_append(contact.xml_create_element(u'name', content=u'Patrick Hines'))
                               attributes={u'type': u'home'},
                               attributes={u'type': u'work'},

'15. Printing the value of the <phone> element whose type attribute has the value "home"'

print u'Home phone is:', contact.xml_xpath(u'phone[@type="home"]')

'16. Deleting the type attribute of the first <phone> element under the <contact> element'


'17. Transforming our original <contacts> element to a new <contacts> element containing a list of <contact> elements whose children are <name> and <phoneNumbers>'

new_contacts = doc.xml_create_element(u'contacts')
for c in
    for p in

'18. Retrieving the names of all the contacts from Washington, sorted alphabetically '

wash_contacts = contacts.xml_xpath(u'contact[address/state="WA"]')
names = [ unicode( for c in ]

[Uche Ogbuji]

via Copia

Solution: simple XML output "templates" for Amara

A few months ago in "Sane template-like output for Amara" I discussed ideas for making the Amara output API a little bit more competitive with full-blown templating systems such as XSLT, without adopting all the madness of template frameworks.

I just checked in the simplest patch that does the trick. Here is an example from the previous article:

Amara 1.0 code:

person_elem = newdoc.xml_element(
        attributes={u'name': unicode(}

Proposed Amara 1.2 code:

newdoc.xml_append_template("<person name='{}'/>")

What I actually checked into CVS today for Amara 1.2:

newdoc.xml_append_fragment("<person name='%s'/>"

That has the advantage of leaning as much as possible on an existing Python concept (formatted strings). As the method name indicates, this is conceptually no longer a template, but rather a fragment of XML in text form. The magic for Amara is in allowing one to dynamically create XML objects from such fragments. I think this is a unique capability (shared with 4Suite's MarkupWriter) for Python XML output APIs (I have no doubt you'll let me know if I'm wrong).

Also, I think the approach I settled on is best in light of the three "things to ponder" from the older article.

  • Security. Again I'm leaning on a well-known facility of Python, and not introducing any new holes. The original proposal would have opened up possible issues with tainted strings in the template expressions.
  • String or Unicode? I went with strings for the fragments. It's up to the developer to make sure that however he constructs the XML fragment, the result is a plain string and not a Unicode object.
  • separation of model and presentation. There is a very clear separation between Python operations to build a string XML fragment (these are usually the data model objects), and any transforms applied to the resulting XML binding objects (this is usually the separate presentation side). Sure a determined developer can write spaghetti, but I think that with xml_append_fragment it's possible and natural to have a clean separation. With most template systems, this is very hard to achieve.

One other thing to mention is that the dynamic incorporation of the new fragment into the XML binding makes this a potential building block for pipelined processing architecture.

def process_link(body, href, content):
    body.xml_append_fragment('%s'%(href, content))
    #Send the "a" element object that was just appended to
    #the next pipeline stage

def check_unique(a_node):
    if not a_node.href in g_link_dict:
        #index the href to the link text (a element text content)
        g_link_dict[a_node.href] = unicode(a_node)

[Uche Ogbuji]

via Copia

XHTML tutorial pubbed

"XHTML, step-by-step"

Start working with Extensible Hypertext Markup Language. In this tutorial, author Uche Ogbuji shows you how to use XHTML in practical Web sites.

Get started working with Extensible Hypertext Markup Language. XHTML is a language based on HTML, but expressed in well-formed XML. But XHTML is much more than just regularizing tags and characters -- XHTML can alter the way you approach Web design. This tutorial gives step-by-step instruction for developers familiar with HTML who want to learn how to use XHTML in practical Web sites.

In this tutorial

  • Tutorial introduction
  • Anatomy of an XHTML Web page
  • Understand the ground rules
  • Replace common HTML idioms
  • Some practical considerations
  • Wrap up

[Uche Ogbuji]

via Copia

Is RDF moving beyond the desperate hacker? And what of Microformats?

I've always taken a desperate hacker approach to RDF. I became a convert to the XML way of expressing documents right away, in 1997. As I started building systems that managed collections of XML documents I was missing a good, declarative means for binding such documents together. I came across RDF, and I was sold. I was never really a Semantic Web head. I used RDF more as a desperate hacker with problems in a fairly well-contained domain. At that time the Sem Web aspirations behind RDF didn't get in the way too badly, so all was well for me. My desperate hacker mindset is probably best summarized in this XML-DEV message from may, 2001.

I see RDF as an excellent modeling tool for closed systems. In my practice, most of the real "knowledge" is in the XML documents at the nodes, but RDF can provide important indexing and relationship expression between these nodes.

I go on to in that message expand on where RDF fits into the architecture of apps for my needs. I also mention a bit of wariness about how RDF's extravagant ambition (i.e. Sem Web) could affect my simple, practical needs.

I quickly found out on www-rdf-logic that in the discussion there, the assumption appear to be that in the semantic Web the RDF statements would carry a heavy burden of the "knowledge" in the system. I've started to think that this idea is a straw man set up by folks who would like RDF to be a fully-blown knowledge-representation language, but if "strong RDF" is indeed a cog in the SW wheel, I fear I must excuse myself from contributing to that discussion because It places me immediately out of my depth.

I've spent a lot of time with RDF, and for a while it was a big part of our consulting practice, but recently applications architecture and schema design (RELAX NG mostly, thank goodness) have been the biggest part of the day job. Honestly, I started to lose touch with where RDF was going. I knew there were some common-sense fixes to bugs in the 1999 specs, but I also knew there were some worrying injections of Sem Web think into the model core. Recently I've had some opportunity to catch up. SPARQL just doesn't fit my head, so a few of us in the Versa 1.0 gang, including Mike Olson and Chimezie, have started work towards Versa 2.0. Mike and Chime have kept up with the state of RDF, and in several discussions, I expressed what I felt were simple view of the RDF model and got in response what I thought were overblown claims about how the RDF model's semantics have been updated. In all cases when I checked the relevant parts of the latest RDF specs I found that Mike and Chime were right and it was rather the RDF model itself that was overblown.

I've developed an overall impression of dismay at the latest RDF model semantics specs. I've always had a problem with Topic Maps because I think that they complicate things in search of an unnecessary level of ontological purity. Well, it seems to me that RDF has done the same thing. I get the feeling that in trying to achieve the ontological purity needed for the Semantic Web, it's starting to leave the desperate hacker behind. I used to be confident I could instruct people on almost all of RDF's core model in an hour. I'm no longer so confident, and the reality is that any technology that takes longer than that to encompass is doomed to failure on the Web. If they think that Web punters will be willing to make sense of the baroque thicket of lemmas (yes, "lemmas", mi amici docte) that now lie at the heart of RDF, or to get their heads around such bizarre concepts as assigning identity to literal values, they are sorely mistaken. Now I hear the argument that one does not need to know hedge automata to use RELAX NG, and all that, but I don't think it applies in the case of RDF. In RDF, the model semantics are the primary reason for coming to the party. I don't see it as an optional formalization. Maybe I'm wrong about that and it's the need to write a query language for RDF (hardly typical for the Web punter) that is causing me to gurgle in the muck.

Assuming it were time for a desperate hacker such as me to move on (and I'm not necessarily saying that I am moving on), where would he go from here? I hear the chorus: microformats. But I see nothing but nasty pricklies down that road. IMO microformats are now where RDF was back in 1999 (actually more like 1998) in terms of practical use to the Web, but in making their specification nothing but a few notes scribbled in a WIki, they are purely syntactic, and offer no semantic anchor. As such, I'm not sure why it makes sense to think of microformats as different from XML ca. 1997. What's the news there? They certainly don't solve my desperate hacker need for indexing and expressing relationships across XML documents. I don't need the level of grounding that RDF seems to so slavishly be aiming for these days, but I need more than scattered Wiki notes.

GRDDL is the RDF community's bid to fix microformats up with some grounding. Funny thing is that in GRDDL they are re-discovering what the desperate hackers at Fourthought devised almost four years ago in "document definitions" to map XML syntax to RDF statements using XPath and XSLT. The desperate hacker in me feels at the same time vindicated, and left in the weeds. Sure GRDDL gets RDF some of what I've thought it's needed for ages, but it still does wed microformats to the present-day RDF model, which is just what I'm becoming uneasy about.

I'm more wandering around than getting anywhere in this entry, I freely admit. Working the grounding layer for XML is still what I consider to be my work of primary career interest. Lately, this work has led me more in the direction of schema annotations, as you can see in some of my recent articles on IBM developerWorks. Architectural forms are the closest thing the SGML old-heads gave us to syntax-semantic grounding (grounded to HyTime, of course), and AF were a creature of the schema. Perhaps it's high time we went back to learn that old-head lesson and quit fiddling around with brittle post-schema transformations.

As for the modeling system to use as the basis for grounding XML syntax, I don't know. I stick to RDF for now, but I'll have to see if it's possible to use it interoperably while still ignoring the more esoteric flourishes it's picked up lately. The Versa discussions at first gave me the impression that these flourishes are inevitable, but more recent threads have been a bit more encouraging.

I certainly hope that it doesn't take another rewind to RDF circa 2000 to satisfy the desperate hacker.

[Uche Ogbuji]

via Copia

Live Markdown Compilation via 4XSLT / 4Suite Repository

Related to uche's recent entry about PyBlosxom + CherryPy, I recently wrote a 4XSLT extension that compiles a 4Suite Repository RawFile (which holds a Markdown document) into an HTML 4.1 document on the fly. I'm using it to host a collaborative markdown-based Wiki.

The general idea to allow the Markdown document to reside in the repository and be editable by anyone (or specific users). The raw content of that document can be viewed with a different URL: . That is the actual location of the file, the previous URL is actually a REST service setup with the 4Suite Server instance running on metacognition that listens for requests with a leading /markdown and redirects the request to a stylesheet that compiles the content of the file and returns an HTML document.

The relevant section of the server.xml document is below:

         xslt-transform='/extensions/RenderMarkdown.xslt'   />

This makes use of a feature in the 4Suite Repository Server architecture that allows you to register URL patterns to XSLT transformations. In this case, all incoming requests for paths with a leading /markdown are interpreted as a request to execute the stylesheet /extensions/RenderMarkdown.xslt with a top-level path parameter which is the full path to the markdown document (/markdown-documents/RDFInterfaces.txt in this case). For more on these capabilities, see: The architecture of 4Suite Web applications.

The rendering stylesheet is below:

<?xml version="1.0" encoding="UTF-8"?>
        extension-element-prefixes="exsl md fcore"
        exclude-result-prefixes="fcore ftext exsl md xsl">
          doctype-public="-//W3C//DTD HTML 4.01 Transitional//EN" 
        <xsl:param name="path"/>
        <xsl:param name="title"/>
        <xsl:param name="css"/>
        <xsl:template match="/">        
            <title><xsl:value-of select="$title"/></title>         
            <link href="{$css}" type="text/css" rel="stylesheet"/>
            <xsl:copy-of select="md:renderMarkdown(fcore:get-content($path))"/>

This stylesheet makes use of a md:renderMarkdown extension function defined in the Python module below:

from pymarkdown import Markdown
    from Ft.Xml.Xslt import XsltElement,ContentInfo,AttributeInfo
    from Ft.Xml.XPath import Conversions
    from Ft.Xml import Domlette


    def RenderMarkdown(context, markDownString):
        dom = Domlette.NonvalidatingReader.parseString(str(rt),"urn:uuid:Blah")
        return [dom.documentElement]

    ExtFunctions = {
        (NS, 'renderMarkdown'): RenderMarkdown,

Notice that the stylesheet allows for the title and css to be specified as parameters to the original URL.

The markdown compilation mechanism is none other than the used by Copia.

For now, the Markdown documents can only be edited remotely by editors that know how to submit content over HTTP via PUT as well as handle HTTP authentication challenges if met with a 401 for a resource in the repository that isn't publicly available (in this day and age it's a shame there are only a few such editors - The one I use primarily is the Oxygen XML Editor).

I hope to later add a simple HTML-based form for live modification of the markdown documents which should complete the very simple framework for a markdown-based, 4Suite-enabled mini-Wiki.

Chimezie Ogbuji

via Copia

Python + XML = wary coexistence

There has been quite a bit of discussion triggered by my article "Python and XML: Should Python and XML Coexist?". This sort of thing always surprises me more than it should. I like to post code-heavy articles and leave the philosophy to the occasional entry, or to this very Weblog, but it seems that people respond more vocally to philosophy than to code. Perhaps I'll discuss with Kendall, my editor, what this suggests in terms of future directions for my Python/XML column.

Anyway, first I point to PJE's response. I used quotes from his Weblog as jumping-off points for my article.

Uche Ogbuji liberally quotes from and analyzes two of my XML-v.-Python rants, and actually gets it completely right. Since at least one of those rants has been cited as meaning I think XML is the spawn of Satan, I'm glad Uche read closely enough to get the context and nuance, without projecting things into it that I didn't say. Kudos!

I don't claim to know whom PJE speaks of when he refers to other commentary on his rant, but Martijn Faassen indicated his own response. I do think that Martijn missed some of PJE's intended nuance, but to be fair, it took me more than one reading to catch that nuance. I think that PJE could have saved himself a lot of misunderstanding, but hell, I've had my turn at thickly nuanced rant myself, so I see both sides. Looking more broadly at the landscape, Martijn puts succinctly what I've said in the past.

This disdain for XML technologies is very common among Python programmers.

But maybe that means something greater than petty rivalry. Mike Champion brought up my article on XML-DEV:

For some time now we've seen the JSON "fat-free alternative to XML" direction that some in the AJAX world are taking to address both XML's inefficiency and the mismatch with programming languages. Now I see that many in the Python community have a similar attitude toward XML and encourage its use only when necessary to exchange data with non-Python apps.

He followed with a list of thoughts, touching on the likely roles of JSON, Python, XML, and more, and I responded. To much to quote from the exchange. Read the originals yourself, if you like. I will mention the final thought in my response:

In many ways I think a vicious backlash from programming languages against XML is just what XML needs right now.

In saying that, I had in mind some of my other prosaic articles about the direction of XML, including:

I think that many XML folks have been working to encroach on the territory of languages such as Python, even if Python folks aren't always clear on this fact while complaining about XML. We'll just have to see how it all shakes out. I know what pattern of tool usage I'll stick to for now. Speaking of omni-tools, Dimitre Novatchev put in a plug for XSLT as general-purpose programming language, which he's also done here in Copia comments. I still think it's a bad idea to treat XSLT as anything other than a template language. XSLT in its place, Python (or Javascript, Ruby, or whatever) in its place.

In the comments on my article there are some interesting bits, including one correspondent's mention of the importance of open file formats, and the XML's role in this, followed bewilderingly by:

C++ is so powerful that with the right classes, many of the advantages of a scripting language are attainable.

Sounds like someone who badly needs to actually try Python.

[Uche Ogbuji]

via Copia

Extracting RDF from XML in 'Closed' vs 'Open Systems'

For some time, I had wanted to write a bit about 4Suite's Document Definitions - especially after first reading about the concept of Gleaning Resource Descriptions from Dialects of Languages (GRDDL). You see, the idea isn't so novel to me since I've been involved in 4Suite development for some time and familiar with the concept of a Document Definition. Unfortunately, 4Suite's Achilles heel is documentation (no pun intended), but I've managed to find a representative thread on the subject within the mailing list archives. In addition, I also included a decent definition (by Mike Brown) from his overview of the repository:

A DocumentDefinition is a resource that describes how to derive RDF statements from the XML -- deserialization guidelines, basically. Its content can either be XML or XSLT that follows certain guidelines. When the XmlDocument that is associated with this docdef is created, updated, or deleted, RDF statements will be updated automatically in the user model. This is really powerful, and is described in more detail here (free registration required). As an example, if the XML doc is XHTML, then you could write a docdef to generate a Dublin Core 'title' RDF statement from the /html/head/title element. Anytime the XML doc is updated, the RDF statements derived from it via the docdef will also be updated. These statements, being automatically managed, are stored in the "system" model, but there has been some discussion as to whether that is appropriate and how it might change in the future. Only one docdef can be associated with a document, but docdefs can import definitions from one another, if needed

The primary difference between GRDDL (as I understand the principle) and Document Definitions is that GRDDL is an attempt to provide a mechanism for extracting RDF from microformats (subsets of XHTML) 'in the wild.' The XML content transformed (via XSLT) is often embedded within presentation markup and perhaps constructed w/ little regard to validity (with respect to a governing schema). The value is in being able to harvest RDF content from sources designed with more human readability than machine readability in mind. The sheer number of such documents is a multiplicative factor to how much useful information can be extracted.

Document Definitions on the other hand are meant to work in a closed system where the XML vocabulary is self-contained and most often valid (with respect to a well known format) as well as well-formed (the requirement common to both scenarios). The different contexts are very significant and describe two completely divergent approaches to applying RDF to solve Knowledge Management problems.

There are some well known advantages to writing XML->RDF transforms for closed vocabularies / systems (portability, easing the RDF/XML serialization learning curve,etc..) and there are some that not as well known (IMHO). In particular, writing transforms for closed vocabularies essentially allows the XML vocabulary to behave as a communication medium between systems that 'speak XML' and an RDF datastore.

Consider Bill de hOra's issues with binding forms (HTML in his case) to RDF via the RDF/XML syntax. This is an irresolvable disaster and the culprit is the violent impedance mismatch between the XML and RDF data structures that manifests itself in the well documented horrors of RDF/XML as a persistent representation of an RDF graph.

Consider a more elegant architecture: Building an XForms UI on top of XML instances (associated with - but not necessarily validated by - a schema) and automatically transposed (by a transform written once) to a corresponding RDF graph. The strengths of both data formats are emphasized in this scenario and the impedance mismatch is completely resolved by pushing the onus from forms authoring to a well designed transform (written once only).

[Uche Ogbuji]

via Copia

XSLT 2.0 and push/pull

I just finally got a chance to read Bob DuCharme's article "Push, Pull, Next!", which starts by referring to my "Push vs pull XSLT". It shows how one might use XSLT 2.0's xsl:next-match to stay with push in some instances where pull becomes attractive. This instruction is similar in idea to XSLT 1.0's xsl:apply-imports, except that it doesn't require you to organize templates into separate, imported files. It also supports xsl:with-param, which is also available in XSLT 2.0's version of xsl:apply-imports. Bob wasn't clear enough in his article that XSLT 1.0 also has xsl:apply-imports, but that's clarified int he comments. One important aspect of the use of these instructions in XSLT 2.0 is that xsl:with-param becomes so much more useful in the new version now that default template rules no longer discard parameters. XSLT 2.0 did manage here to squash one of the bigger gotchas in XSLT 1.0.

[Uche Ogbuji]

via Copia