Solution: simple XML output "templates" for Amara

A few months ago in "Sane template-like output for Amara" I discussed ideas for making the Amara output API a little bit more competitive with full-blown templating systems such as XSLT, without adopting all the madness of template frameworks.

I just checked in the simplest patch that does the trick. Here is an example from the previous article:

Amara 1.0 code:

person_elem = newdoc.xml_element(
        attributes={u'name': unicode(}

Proposed Amara 1.2 code:

newdoc.xml_append_template("<person name='{}'/>")

What I actually checked into CVS today for Amara 1.2:

newdoc.xml_append_fragment("<person name='%s'/>"

That has the advantage of leaning as much as possible on an existing Python concept (formatted strings). As the method name indicates, this is conceptually no longer a template, but rather a fragment of XML in text form. The magic for Amara is in allowing one to dynamically create XML objects from such fragments. I think this is a unique capability (shared with 4Suite's MarkupWriter) for Python XML output APIs (I have no doubt you'll let me know if I'm wrong).

Also, I think the approach I settled on is best in light of the three "things to ponder" from the older article.

  • Security. Again I'm leaning on a well-known facility of Python, and not introducing any new holes. The original proposal would have opened up possible issues with tainted strings in the template expressions.
  • String or Unicode? I went with strings for the fragments. It's up to the developer to make sure that however he constructs the XML fragment, the result is a plain string and not a Unicode object.
  • separation of model and presentation. There is a very clear separation between Python operations to build a string XML fragment (these are usually the data model objects), and any transforms applied to the resulting XML binding objects (this is usually the separate presentation side). Sure a determined developer can write spaghetti, but I think that with xml_append_fragment it's possible and natural to have a clean separation. With most template systems, this is very hard to achieve.

One other thing to mention is that the dynamic incorporation of the new fragment into the XML binding makes this a potential building block for pipelined processing architecture.

def process_link(body, href, content):
    body.xml_append_fragment('%s'%(href, content))
    #Send the "a" element object that was just appended to
    #the next pipeline stage

def check_unique(a_node):
    if not a_node.href in g_link_dict:
        #index the href to the link text (a element text content)
        g_link_dict[a_node.href] = unicode(a_node)

[Uche Ogbuji]

via Copia

Nofollow-free Copia

I read this interesting post by Darryl VanDorp:

Hey people new to this whole blog thing. If you want me to actually leave a comment on that new blog of yours. Turn off that damn default nofollow in your shiny wordpress installation ... If I take the time to leave a comment give me some juice!

I quite agree. Yes, I know nofollow came about as a way to discourage Weblog spammers, but I don't believe it will ever work. It is just too easy to write robots that leave comment spam, and unless 100% of systems use nofollow, which is never likely, it will always be worth the spammers' while to keep trying on the off chance that they find sites without nofollow. Also, spammers are probably happy to have their messages on Weblogs, google juice or no.

Look, we don't want comment spam on Copia, period. It's not enough to deprive them of google juice. We work to keep the site spam-free. All comments go in an approval queue, and we have a lot of handy little tools to help eliminate comment spam, so we can batch approve the remaining goodness. I think this takes a lot less effort than the actual work of composing entries on the blog (and I'm a fast writer). What's more relevant, it's less effort than it takes for visitors to compose their comments. As a result of this effort there is no need to use nofollow on Copia, and we don't do so. I don't know whether commenters get significant juice from Copia, but what juice we do have to give we shall not stint our correspondents. (Not even if they are here to engage in the heresy of disputing our positions.)

I can't help thinking of the gardening analogy. If you want a nice garden, you have to roll up your sleeves and pull up your weeds. There is no quick fix to preventing weeds: the conditions that make your soil attractive for germination of desirable flowers also make it attractive to Uncle Darwin's rogues.

[Uche Ogbuji]

via Copia

Today's XML WTF

via Sam Ruby:

While [REXML] is certainly the most elegant Ruby XML API, it seems to accept a variety of ill-formed XML fragments, for example the following produces no error: [<div>at&t]

F'real? That is, not only missing end tag, but also unescaped ampersand?

It is just not frigging cool to be releasing anything called an XML parser or processor in 2005 that does not reject ill-formed XML. Folks, well-formedness is the entire point of XML. If that's an inconvenient fact for you, please be so kind as to use something other than XML. What is even more galling is this from the REXML home page:

REXML is an XML processor for the language Ruby. REXML is conformant (passes 100% of the Oasis non-validating tests), and includes full XPath support.

On Sam's evidence (and you don't get much more credible than Sam Ruby), this statement is quite false. The OASIS XML 1.0 tests have a whole section covering rejection of non-well-formed documents.

Sam goes on to say:

Peeking into the implementation of REXML, I see that it is riddled with regular expressions. Having a parser that doesn’t detect errors properly is one thing, but having a parser that incorrectly parses valid input is quite another. I’ve opened a ticket on one such problem.  Depending on how it is received, I may open others.

OK. Let's hope the REXML folks pay attention to Sam and get things right.

And before Python folks get all smug, it seems that such fast and loose interpretations of what "XML" means is hardly alien to the Python community. Here's a thread on the XML-SIG with a "list of packages handling XML 1.1". Any sensible person would expect these to be XML 1.1 parsers, but no, it turns out that the title is a bit of casuistry, and that at least 2 of the 4 listed packages accept ill-formed XML 1.1. It seems to me that pyparsing, Python's re library, Python's string methods, and any other Python software that does anything with strings should be added to such a list. The only way I could imagine such a list being redeemed is if entries that did not accept well-formed XML 1.1 at least offered warnings of ill-formedness, and could thus serve as tidy-like tools for fixing broken XML. This does not seem to be the case.

As I've said in the past, I don't claim that only XML parsers and processors should be used to work with XML. Heck, I use grep, wc, sed and the usual text tools all the time with XML documents. I do say that it is dishonest to call something an XML parser or processor unless you treat non-compliance as a bug. I guess it's the old social principle all over again. XML is hot, so it's voguish to be called an XML processor, yet it's all so tempting to shirk the required work.

[Uche Ogbuji]

via Copia

XHTML tutorial pubbed

"XHTML, step-by-step"

Start working with Extensible Hypertext Markup Language. In this tutorial, author Uche Ogbuji shows you how to use XHTML in practical Web sites.

Get started working with Extensible Hypertext Markup Language. XHTML is a language based on HTML, but expressed in well-formed XML. But XHTML is much more than just regularizing tags and characters -- XHTML can alter the way you approach Web design. This tutorial gives step-by-step instruction for developers familiar with HTML who want to learn how to use XHTML in practical Web sites.

In this tutorial

  • Tutorial introduction
  • Anatomy of an XHTML Web page
  • Understand the ground rules
  • Replace common HTML idioms
  • Some practical considerations
  • Wrap up

[Uche Ogbuji]

via Copia

Firing SAX events from a DOM tree in 4Suite

One nice thing about the briskly-moving 4Suite documentation project is that it is shining a clear light on places where we need to make the APIs more versatile. Adding convenience parse functions was one earlier result.

Saxlette has the ability to walk a Domlette tree, firing off events to a handler as if from a source document parse. This ability used to be too well, hidden, though, and I made an API addition to make it more readily available. This is the new Ft.Xml.Domlette.SaxWalker. The following example should show how easy it is to use:

from Ft.Xml.Domlette import SaxWalker
from Ft.Xml import Parse

XML = ""

class element_counter:
    def startDocument(self):
        self.ecount = 0

    def startElementNS(self, name, qname, attribs):
        self.ecount += 1

#First get a Domlette document node
doc = Parse(XML)
#Then SAX "parse" it
parser = SaxWalker(doc)
handler = element_counter()
#You can set any properties or features, or do whatever
#you would to a regular SAX2 parser instance here
parser.parse() #called without any argument
print "Elements counted:", handler.ecount

Again Saxlette and Domlette are fully implemented in C, so you get great performance from the SaxWalker.

[Uche Ogbuji]

via Copia

Python/XML column #37 (and out): Processing Atom 1.0

"Processing Atom 1.0"

In his final Python-XML column, Uche Ogbuji shows us three ways to process Atom 1.0 feeds in Python. [Sep. 14, 2005]

I show how to parse Atom 1.0 using minidom (for those who want no additional dependencies), Amara Bindery (for those who want an easier API) and Universal Feed Parser (with a quick hack to bring the support in UFP 3.3 up to Atom 1.0). I also show how to use DateUtil and Python 2.3's datetime to process Atom dates.

As the teaser says, we've come to the end of the column in its present form, but it's more of a transition than a termination. From the article:

And with this month's exploration, the Python-XML column has come to an end. After discussions with my editor, I'll replace this column with one with a broader focus. It will cover the intersection of Agile Languages and Web 2.0 technologies. The primary language focus will still be Python, but there will sometimes be coverage of other languages such as Ruby and ECMAScript. I think many of the topics will continue to be of interest to readers of the present column. I look forward to continuing my relationship with the audience.

It is too bad that I don't get to some of the articles that I had in the queue, including coverage of lxml pygenx, XSLT processing from Python, the role of PEP 342 in XML processing, and more. I can still squeeze some of these topics into the new column, I think, as long as I make an emphasis on the Web. I'll also try to keep up my coverage of news in the Python/XML community here on Copia.

Speaking of such news, I forgot to mention in the column that I'd found an interesting resource from John Shipman.

[F]or my relatively modest needs, I've written a more Pythonic module that uses minidom. Complete documentation, including the code of the module in 'literate programming' style, is at:

The relevant sections start with section 7, "".

[Uche Ogbuji]

via Copia

Is RDF moving beyond the desperate hacker? And what of Microformats?

I've always taken a desperate hacker approach to RDF. I became a convert to the XML way of expressing documents right away, in 1997. As I started building systems that managed collections of XML documents I was missing a good, declarative means for binding such documents together. I came across RDF, and I was sold. I was never really a Semantic Web head. I used RDF more as a desperate hacker with problems in a fairly well-contained domain. At that time the Sem Web aspirations behind RDF didn't get in the way too badly, so all was well for me. My desperate hacker mindset is probably best summarized in this XML-DEV message from may, 2001.

I see RDF as an excellent modeling tool for closed systems. In my practice, most of the real "knowledge" is in the XML documents at the nodes, but RDF can provide important indexing and relationship expression between these nodes.

I go on to in that message expand on where RDF fits into the architecture of apps for my needs. I also mention a bit of wariness about how RDF's extravagant ambition (i.e. Sem Web) could affect my simple, practical needs.

I quickly found out on www-rdf-logic that in the discussion there, the assumption appear to be that in the semantic Web the RDF statements would carry a heavy burden of the "knowledge" in the system. I've started to think that this idea is a straw man set up by folks who would like RDF to be a fully-blown knowledge-representation language, but if "strong RDF" is indeed a cog in the SW wheel, I fear I must excuse myself from contributing to that discussion because It places me immediately out of my depth.

I've spent a lot of time with RDF, and for a while it was a big part of our consulting practice, but recently applications architecture and schema design (RELAX NG mostly, thank goodness) have been the biggest part of the day job. Honestly, I started to lose touch with where RDF was going. I knew there were some common-sense fixes to bugs in the 1999 specs, but I also knew there were some worrying injections of Sem Web think into the model core. Recently I've had some opportunity to catch up. SPARQL just doesn't fit my head, so a few of us in the Versa 1.0 gang, including Mike Olson and Chimezie, have started work towards Versa 2.0. Mike and Chime have kept up with the state of RDF, and in several discussions, I expressed what I felt were simple view of the RDF model and got in response what I thought were overblown claims about how the RDF model's semantics have been updated. In all cases when I checked the relevant parts of the latest RDF specs I found that Mike and Chime were right and it was rather the RDF model itself that was overblown.

I've developed an overall impression of dismay at the latest RDF model semantics specs. I've always had a problem with Topic Maps because I think that they complicate things in search of an unnecessary level of ontological purity. Well, it seems to me that RDF has done the same thing. I get the feeling that in trying to achieve the ontological purity needed for the Semantic Web, it's starting to leave the desperate hacker behind. I used to be confident I could instruct people on almost all of RDF's core model in an hour. I'm no longer so confident, and the reality is that any technology that takes longer than that to encompass is doomed to failure on the Web. If they think that Web punters will be willing to make sense of the baroque thicket of lemmas (yes, "lemmas", mi amici docte) that now lie at the heart of RDF, or to get their heads around such bizarre concepts as assigning identity to literal values, they are sorely mistaken. Now I hear the argument that one does not need to know hedge automata to use RELAX NG, and all that, but I don't think it applies in the case of RDF. In RDF, the model semantics are the primary reason for coming to the party. I don't see it as an optional formalization. Maybe I'm wrong about that and it's the need to write a query language for RDF (hardly typical for the Web punter) that is causing me to gurgle in the muck.

Assuming it were time for a desperate hacker such as me to move on (and I'm not necessarily saying that I am moving on), where would he go from here? I hear the chorus: microformats. But I see nothing but nasty pricklies down that road. IMO microformats are now where RDF was back in 1999 (actually more like 1998) in terms of practical use to the Web, but in making their specification nothing but a few notes scribbled in a WIki, they are purely syntactic, and offer no semantic anchor. As such, I'm not sure why it makes sense to think of microformats as different from XML ca. 1997. What's the news there? They certainly don't solve my desperate hacker need for indexing and expressing relationships across XML documents. I don't need the level of grounding that RDF seems to so slavishly be aiming for these days, but I need more than scattered Wiki notes.

GRDDL is the RDF community's bid to fix microformats up with some grounding. Funny thing is that in GRDDL they are re-discovering what the desperate hackers at Fourthought devised almost four years ago in "document definitions" to map XML syntax to RDF statements using XPath and XSLT. The desperate hacker in me feels at the same time vindicated, and left in the weeds. Sure GRDDL gets RDF some of what I've thought it's needed for ages, but it still does wed microformats to the present-day RDF model, which is just what I'm becoming uneasy about.

I'm more wandering around than getting anywhere in this entry, I freely admit. Working the grounding layer for XML is still what I consider to be my work of primary career interest. Lately, this work has led me more in the direction of schema annotations, as you can see in some of my recent articles on IBM developerWorks. Architectural forms are the closest thing the SGML old-heads gave us to syntax-semantic grounding (grounded to HyTime, of course), and AF were a creature of the schema. Perhaps it's high time we went back to learn that old-head lesson and quit fiddling around with brittle post-schema transformations.

As for the modeling system to use as the basis for grounding XML syntax, I don't know. I stick to RDF for now, but I'll have to see if it's possible to use it interoperably while still ignoring the more esoteric flourishes it's picked up lately. The Versa discussions at first gave me the impression that these flourishes are inevitable, but more recent threads have been a bit more encouraging.

I certainly hope that it doesn't take another rewind to RDF circa 2000 to satisfy the desperate hacker.

[Uche Ogbuji]

via Copia

Convenience APIs for 4Suite Domlette parsing

I added some functions to make a baby step into Domlette parsing. I call these the brain-dead APIs (though we use more decorous terms officially). You can now get yourself a crisp DOM with as little effort as:

>>> from Ft.Xml import Parse
>>> doc = Parse("<hello>world</hello>")

And thence on to the usual fun stuff:

>>> print doc.xpath(u"string(hello)")

Parse also knows how to handle streams (file-like objects):

>>> doc = Parse(open("hello.xml"))

Do a help(Parse) to get a warning about using that function with XML that is not self-contained, and an example of how to parse such XML properly in 4Suite.

There is also ParsePath, which handles files paths and URLs:

>>> from Ft.Xml import ParsePath
>>> doc = ParsePath("hello.xml")
>>> doc = ParsePath("")

And what do you know? That last URL does not parse. My feed is ill-formed. ☠☠☠☠☠ (I love cursing in Unicode) Sigh. Gotta fix the PyBlosxom Atom generation. Maybe we need a touch of 4Suite (Amara, actually) there, as well.

[Uche Ogbuji]

via Copia

Kumite! Python vs. Javascript! script vs. XForms! declarativity vs. wizards!

Sylvain's question about Javascript-without-tears certainly kicked off a chain of interesting dialogue, for me. It also happens to dovetail with some interesting dialog elsewhere that started independently.

Kurt had an interesting take in his follow-up "Javascript and Python"

To be blunt, I would prefer to see a Python interpreter being standard issue in all browsers in addition to a Javascript one, as I believe the language to be much more expressive, more object oriented, more secure, easier to write, and in general better suited to contemporary needs. Unfortunately, browsers in particular are informed as much by business and political decisions, some, perhaps even most, based less upon the best technology for a problem and more based upon what will provide the best backward compatibility to insure that existing websites do not break or that download sizes remain under some critical value.

I like Python as well, but boy do I shudder to imagine the political conflagration that would ensue if Python were elevated in Web programming over peers such as Perl and Ruby. Javascript appears to have carved out a niche as the Switzerland of dynamic languages.

I would like to see diversity in Web scripting languages, so that others have choices besides Javascript, and as Kurt says, Mozilla does seem to be taking the lead in this. It's really cool to see the Mozilla Wiki entry "Breaking the grip JS has on the DOM". And you gotta love the opening lines:

We want to change the grip JS has on the DOM and on XUL. We will do this in 2 steps:

Ideally, the first step could be done without consideration for the second, in the assumption at the implementation should be truly language neutral.

But regardless of scripting language, there is the fact that XForms is out there looking to take on the very need for scripting in many of the browser use cases. Kurt says:

I am a major advocate of XML, precisely because it is much more difficult to isolate a language when a mapping is essentially an XSLT transformation away. For this reason, XForms is a very attractive model to be moving towards, certainly, and I look forward to the day that I can build XForms applications that work in all browsers equally. However, XForms is not even remotely widely implemented yet nor are there standard forms of declarative binding languages along the lines of XBL (sXBL is getting there, but so far there are perhaps two still very much test implementations in existence).

Chime predictably takes exception to this (in a response to Kurt). He's been a very involved early adopter of XForms, and I've been amazed to see how productive XForms has made him, so I take his point very seriously when he says:

I beg to differ. Though it may be true that it doesn't quite have the traction that Flash has at the moment, I wouldn't go as far as saying it isn't remotely, widely implemented. There are several very mature implementations...
[...] [re:eliminating the need for an imperative language] Once again, I have yet find myself in a situation where I needed javascript for UI-related capabilities that weren't covered by XForms event processing, instance binding, and other such [programmatic] components. The only time I did was when I had to encode XML content as base64 encoded binary (see: and had little to do with XForms but more with the means of remote communication (SOAP). I'm not suggesting that frameworks such as XForms will eliminate the need for an imperative language, but rather that the need will be more like the reverse of your 80/20 proposition.

To be fair, I think Kurt was saying that script is only needed for the 20% case, and not the 80% case. He just felt that declarative solutions architecture "is hideously inappropriate for the remaining 20%." I do agree with that, but I think that people (not Kurt) tend to exaggerate this fact as an argument against declarative programming.

Chime wraps up:

I must give the disclaimer that I'm not suggesting XForms will be a user interface / browser-based application building [panacea], but rather that the potential it has to eliminate the unportable, architecturally unsound code that often drive DHTML web-sites with minimal complexity is very much overlooked.

Based on what I've seen of XForms, I tend to agree. I actually think Chime and Kurt are more in agreement than they sound, except maybe on the matter of the maturity of XForms engines.

At almost the same time another script versus XML exchange was going on in Mark Birbeck's blog, and in particular "On Adobe and XForms via Declarative Programming, Wizards and Aspects". That article is well worth reading in its entirety, but I'll highlight what he says about Wizards: nearly all cases I find the 'wizard approach' is great to get you started, but then very quickly gets complex again. Anyone who uses Microsoft's Visual Studio, for example, will know that getting a C++ application up and running quickly, with support for multiple windows, toolbars, printing and file saving, is a snip. But then when you want to modify that code and move away from the wizard, you are very soon into normal C++ territory.

I tend to put this point even more strongly, as I do in my article "The worry about program wizards", but I'm always happy to have it reinforced.

In some ways I see wizards as the shoddy high street knock-off of declarative systems. Well designed declarative systems fully encapsulate modal aspects of the application in development, and they expose slots for ready extension in imperative implementation, if needed. Wizards, on the other hand, do focus on parameters in a way tantalizingly like declarative systems, but then ruin the entire plot by handing the programmer a hairball of imperative code that they have to hack at arbitrarily in order to complete the application. It's the difference between just plugging a device into a USB port to add capability to your PC, rather than having the motherboard thrust in your face so that you can find the right place to solder in the leads.

Chimezie Ogbuji

via Copia

Thinking XML #33: Serving up WordNet as XML

"Thinking XML: Serving up WordNet as XML"

Subtitle: Build the basic WordNet/XML facilities into a Web server framework
Synopsis: A few articles back, Uche Ogbuji discussed WordNet 2.0, a Princeton University project that aims to build a database of English words and lexical relationships between them. He showed how to extract XML serializations from the word database. In this article he continues the exploration, demonstrating code to serve up these WordNet/XML documents over Web protocols and showing you how to access these from XSLT.

This is the second part of a mini-series within the column. The previous article is "Querying WordNet as XML,", in which I present Python code for processing WordNet 2.0 into XML. This time I use CherryPy to expose the XML on the Web, either in human-readable or in raw form. This seems to be part of a nice trend of CherryPy on developerWorks. I hope people see this as yet another example of how easy and clean CherryPy is.

See other articles in the column. Comments here on Copia or on the column's official discussion forum. Next up in Thinking XML, RDF equivalents for the WordNet/XML.

[Uche Ogbuji]

via Copia