daily links, using Amara

I added a new feature on Copia: Every day there will be an automated posting with mine and Chime's links from the previous day. You can see, in the previous Copia entry to this one, an example of the results.

What I think most cool is how easy it was to write, and how easy the resulting code is to understand. It's just 35 lines (including 7 lines of imports) , and in that it packs some useful features I haven't found in other such scripts, including:

  • Full Unicode safety (naturally, I wouldn't have it any other way)
  • support for multiple feeds, with tag by author
  • tagging the PyBlosxom entry with the aggregated/unique tags from the entries

Here's the code. The only external requirement is Amara:

import os
import sets
import time
import codecs
import itertools
from datetime import date, timedelta

from amara import binderytools


#Change BASEDIR and FEEDS to customize
BASEDIR = '/srv/www/'
FEEDS = ['', '']

now = time.gmtime()
timestamp = unicode(time.strftime('%Y-%m-%dT%H:%M:%SZ', now))
targetdate = (date(*now[:3]) - timedelta(1)).isoformat()

#Using Amara.  Easy to just grab the RSS feed
docs = map(binderytools.bind_uri, FEEDS)
items = itertools.chain(*[ doc.RDF.item for doc in docs ])
current_items = [ item for item in items
                       if unicode( ]
if current_items:
    # Create a Markdown page with the daily bookmarks.
    dir = '%s/%s' % (BASEDIR, targetdate)
    if not os.path.isdir(dir):
    f ='%s/%s/' % (BASEDIR, targetdate), 'w', 'utf-8')

    # Pyblosxom Title
    f.write(u' bookmarks for %s\n' % targetdate)

    tags = sets.Set()
    for item in current_items:
        tags.update([ li.resource[len(TAGBASE):] for li in ])
    f.write(u'#post_time %s\n'%(timestamp))
    f.write(u'<!--keywords:,%s -->\n'%(u','.join(tags)))

    for item in current_items:
        # List of links in Markdown.
        title = getattr(item, 'title', u'')
        href = getattr(item, 'link', u'')
        desc = getattr(item, 'description', u'')
        creator = getattr(item, 'creator', u'')
        f.write(u'* "[%s](%s)": %s *(from %s)*\n' % (title, href, desc, creator))


Or download

You can see how easily you can process RSS 1.0 in Amara. I don't think actual RDF parsing/processing is a bit necessary. That extra layer is the first thing that decided me against Matt Biddulph's module, in addition to his use of libxml for XML processing, which is also used in Roberto De Almeida's.

[Uche Ogbuji]

via Copia

Today's XML WTF: Internal entites in browsers

This unnecessary screw-up comes from the Mozilla project, of all places. Mozilla's XML support is improving all the time, as I discuss in my article on XML in Firefox, but the developer resources seem to lag the implementation, and this often leads to needless confusion. One that I ran into recently could perhaps be given the summary: "not everything in the Mozilla FAQ is accurate". From the Mozilla FAQ:

In older versions of Mozilla as well as in old Mozilla-based products, there is no pseudo-DTD catalog and the use of entities (other than the five pre-defined ones) leads to an XML parsing error. There are also other XHTML user agents that do not support entities (other than the five pre-defined ones). Since non-validating XML processors are not required to support entities (other than the five pre-defined ones), the use of entities (other than the five pre-defined ones) is inherently unsafe in XML documents intended for the Web. The best practice is to use straight UTF-8 instead of entities. (Numeric character references are safe, too.)

See the part in bold. Someone either didn't read the spec, or is intentionally throwing up a spec distortion field. The XML 1.0 spec provides a table in section 4.4: "XML Processor Treatment of Entities and References" which tells you how parsers are allowed to treat entities, and it flatly contradicts the bogus Mozilla FAQ statement above.

The main reason for the "WTF" is the fact that the Mozilla implementation actually gets it right. That it should. It's based on Expat. AFAIK Expat has always got this right (I've been using Expat about as long as the Mozilla project has been), so I'm not sure what inspired the above error. Mozilla should be touting its correct and useful behavior, rather than giving bogus excuses to its competitors.

This came up last week in the IBM developerWorks forum where a user was having problems with internal entities in XHTML. It turns out that he was missing an XHTML namespace (and based on my experimentation was probably serving up XHTML as text/html which is generally a no-no). It should have been a clear case of "Mozilla gets this right, and can we please get other browsers to fix their bugs?" but he found that FAQ entry and we both ended up victims of the red herring for a little while.

I didn't realize that the Mozilla implementation was right until I wrote a careful test case in preparation for my next Firefox/XML article. The following CherryPy code is a test server set-up for browser rendering of XHTML.

import cherrypy

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
         "" [
<!ENTITY internal "This is text placed as internal entity">
<html xmlns="" xml:lang="en-US">
    <title>Using Entity in xhtml</title>
    <p>This is text placed inline</p>
    <abbr title="&internal;">Titpaie</abbr>

class root:
    def text_html(self):
        cherrypy.response.headerMap['Content-Type'] = "text/html; charset=utf-8"
        return INTENTITYXHTML

    def text_xml(self):
        cherrypy.response.headerMap['Content-Type'] = "text/xml; charset=utf-8"
        return INTENTITYXHTML

    def app_xml(self):
        cherrypy.response.headerMap['Content-Type'] = "application/xml; charset=utf-8"
        return INTENTITYXHTML

    def app_xhtml(self):
        cherrypy.response.headerMap['Content-Type'] = "application/xhtml+xml; charset=utf-8"
        return INTENTITYXHTML

cherrypy.root = root()
cherrypy.config.update({'server.socketPort': 9999})
cherrypy.config.update({'logDebugInfoFilter.on': False})

As an example, this code serves up a content type text/html when accessed through a URL such as http://localhost:9999/text_html. You should be able to work out the other URL to content type mappings from the code, even if you're not familiar with CherryPy or Python.

Firefox 1.0.7 handles all this very nicely. For text_xml, app_xml and app_xhtml you get just the XHTML rendering you'd expect, including the correct text in the attribute value with the mouse hovered over "Titpaie".

IE6 (Windows) and Safari 1.3.1 (OS X Panther) both have a lot of trouble with this.

IE6 in the text_xml and app_xml cases complains that it can't find In the app_xhtml case it treats the page as a download, which is reasonable, if not convenient.

Safari in the text_xml, app_xml and app_xhtml cases complains that the entity internal is undefined (??!!).

IE6, Safari and Mozilla in the text_html case all show the same output (looking, as it should, like busted HTML). That's just what you'd expect for a tag soup mode, and emphasizes hat you should leave text_html out of your XHTML vocabulary.

All this confusion and implementation difference illustrates the difficulty for folks trying to deploy XHTML, and why it's probably not yet realistic to deploy XHTML without some sort of browser sniffing (perhaps by checking the Accept header, though it's well known that browsers are sometimes dishonest with this header). I understand that the MSIE7 team hopes to address such problems. I don't know whether to expect the same from Safari. My focus in research and experimentation has been on Firefox.

One final note is that Mozilla does not support external parsed entities. This is legal (and some security experts claim even prudent). The relevant part of the XML 1.0 spec is section 4.4.3:

When an XML processor recognizes a reference to a parsed entity, in order to validate the document, the processor MUST include its replacement text. If the entity is external, and the processor is not attempting to validate the XML document, the processor MAY, but need not, include the entity's replacement text. If a non-validating processor does not include the replacement text, it MUST inform the application that it recognized, but did not read, the entity.

I would love Mozilla to adopt the idea in the next spec paragraph:

Browsers, for example, when encountering an external parsed entity reference, might choose to provide a visual indication of the entity's presence and retrieve it for display only on demand.

That would be very useful. I wonder whether it would be possible through a Firefox plug-in (probably not: I guess it would require very tight Expat integration for plug-ins).

[Uche Ogbuji]

via Copia

RSS feeds for 4Suite (etc.) mailing lists

Jeremy Kloth set up RSS content feeds for Fourthought-hosted mailing lists, including the 4Suite and EXSLT lists (all on Mailman). The list information page for all the lists has an RSS link in the header, so it should be picked up by most news readers. For convenience, though, here are the main lists and the corresponding feeds:

[Uche Ogbuji]

via Copia

Dare's XLINQ examples in Amara

Dare's examples for XLINQ are interesting. They are certainly more streamlined than the usual C# and Java fare I see, but still a bit clunky compared to what I'm used to in Python. To be fair a lot of that is on the C# language, so I'd be interested in seeing what XLINK looks like from Python.NET or Boo.

The following is my translation from Dare's fragments into corresponding Amara fragments (compatible with the Amara 1.2 branch).

'1. Creating an XML document'

import amara
#Done in 2 chunks just to show the range of options
#Another way would be to start with amara.create_document
skel = '<!--XLinq Contacts XML Example--><?MyApp 123-44-4444?><contacts/>'
doc = amara.parse(skel)
  <name>Patrick Hines</name>
    <street1>123 Main St</street1>
    <city>Mercer Island</city>

'2. Creating an XML element in the "" namespace'

doc.xml_create_element(u'contacts', u'')

'3. Loading an XML element from a file'


'4. Writing out an array of Person objects as an XML file'

persons = {}
persons[u'Patrick Hines'] = [u'206-555-0144', u'425-555-0145']
persons[u'Gretchen Rivas'] = [u'206-555-0163']
for name in persons:
    for phone in persons[name]:
print doc.xml()

'5. Print out all the element nodes that are children of the <contact> element'

for c in contact.xml_child_elements():
    print c.xml()

'6. Print all the <phone> elements that are children of the <contact> element'

for c in contact.xml_xpath(u'phone'):
    print c.xml()

'7. Adding a <phone> element as a child of the <contact> element'


'8. Adding a <phone> element as a sibling of another <phone> element'

mobile = contacts.xml_create_element(u'phone', content=u'206-555-0168')
first =
contacts.xml_insert_after(first, mobile)

'9. Adding an <address> element as a child of the <contact> element'

contacts.xml_append_fragment("""  <address>
    <street1>123 Main St</street1>
    <city>Mercer Island</city>

'10. Deleting all <phone> elements under a <contact> element'

for p in contact.xml_remove_child(p)

'11. Delete all children of the <address> element which is a child of the <contact> element'

'12. Replacing the content of the <phone> element under a <contact> element'

#Not really necessary: just showing how to clear the content = u'425-555-0155'

'13. Alternate technique for replacing the content of the <phone> element under a <contact> element' = u'425-555-0155'

'14. Creating a contact element with attributes multiple phone number types'

#I'm sure it's clear by now how easy this would be with xml_append_fragment
#So here is the more analogous API approach
contact = contacts.xml_create_element(u'contact')
contact.xml_append(contact.xml_create_element(u'name', content=u'Patrick Hines'))
                               attributes={u'type': u'home'},
                               attributes={u'type': u'work'},

'15. Printing the value of the <phone> element whose type attribute has the value "home"'

print u'Home phone is:', contact.xml_xpath(u'phone[@type="home"]')

'16. Deleting the type attribute of the first <phone> element under the <contact> element'


'17. Transforming our original <contacts> element to a new <contacts> element containing a list of <contact> elements whose children are <name> and <phoneNumbers>'

new_contacts = doc.xml_create_element(u'contacts')
for c in
    for p in

'18. Retrieving the names of all the contacts from Washington, sorted alphabetically '

wash_contacts = contacts.xml_xpath(u'contact[address/state="WA"]')
names = [ unicode( for c in ]

[Uche Ogbuji]

via Copia

Solution: simple XML output "templates" for Amara

A few months ago in "Sane template-like output for Amara" I discussed ideas for making the Amara output API a little bit more competitive with full-blown templating systems such as XSLT, without adopting all the madness of template frameworks.

I just checked in the simplest patch that does the trick. Here is an example from the previous article:

Amara 1.0 code:

person_elem = newdoc.xml_element(
        attributes={u'name': unicode(}

Proposed Amara 1.2 code:

newdoc.xml_append_template("<person name='{}'/>")

What I actually checked into CVS today for Amara 1.2:

newdoc.xml_append_fragment("<person name='%s'/>"

That has the advantage of leaning as much as possible on an existing Python concept (formatted strings). As the method name indicates, this is conceptually no longer a template, but rather a fragment of XML in text form. The magic for Amara is in allowing one to dynamically create XML objects from such fragments. I think this is a unique capability (shared with 4Suite's MarkupWriter) for Python XML output APIs (I have no doubt you'll let me know if I'm wrong).

Also, I think the approach I settled on is best in light of the three "things to ponder" from the older article.

  • Security. Again I'm leaning on a well-known facility of Python, and not introducing any new holes. The original proposal would have opened up possible issues with tainted strings in the template expressions.
  • String or Unicode? I went with strings for the fragments. It's up to the developer to make sure that however he constructs the XML fragment, the result is a plain string and not a Unicode object.
  • separation of model and presentation. There is a very clear separation between Python operations to build a string XML fragment (these are usually the data model objects), and any transforms applied to the resulting XML binding objects (this is usually the separate presentation side). Sure a determined developer can write spaghetti, but I think that with xml_append_fragment it's possible and natural to have a clean separation. With most template systems, this is very hard to achieve.

One other thing to mention is that the dynamic incorporation of the new fragment into the XML binding makes this a potential building block for pipelined processing architecture.

def process_link(body, href, content):
    body.xml_append_fragment('%s'%(href, content))
    #Send the "a" element object that was just appended to
    #the next pipeline stage

def check_unique(a_node):
    if not a_node.href in g_link_dict:
        #index the href to the link text (a element text content)
        g_link_dict[a_node.href] = unicode(a_node)

[Uche Ogbuji]

via Copia

Today's XML WTF

via Sam Ruby:

While [REXML] is certainly the most elegant Ruby XML API, it seems to accept a variety of ill-formed XML fragments, for example the following produces no error: [<div>at&t]

F'real? That is, not only missing end tag, but also unescaped ampersand?

It is just not frigging cool to be releasing anything called an XML parser or processor in 2005 that does not reject ill-formed XML. Folks, well-formedness is the entire point of XML. If that's an inconvenient fact for you, please be so kind as to use something other than XML. What is even more galling is this from the REXML home page:

REXML is an XML processor for the language Ruby. REXML is conformant (passes 100% of the Oasis non-validating tests), and includes full XPath support.

On Sam's evidence (and you don't get much more credible than Sam Ruby), this statement is quite false. The OASIS XML 1.0 tests have a whole section covering rejection of non-well-formed documents.

Sam goes on to say:

Peeking into the implementation of REXML, I see that it is riddled with regular expressions. Having a parser that doesn’t detect errors properly is one thing, but having a parser that incorrectly parses valid input is quite another. I’ve opened a ticket on one such problem.  Depending on how it is received, I may open others.

OK. Let's hope the REXML folks pay attention to Sam and get things right.

And before Python folks get all smug, it seems that such fast and loose interpretations of what "XML" means is hardly alien to the Python community. Here's a thread on the XML-SIG with a "list of packages handling XML 1.1". Any sensible person would expect these to be XML 1.1 parsers, but no, it turns out that the title is a bit of casuistry, and that at least 2 of the 4 listed packages accept ill-formed XML 1.1. It seems to me that pyparsing, Python's re library, Python's string methods, and any other Python software that does anything with strings should be added to such a list. The only way I could imagine such a list being redeemed is if entries that did not accept well-formed XML 1.1 at least offered warnings of ill-formedness, and could thus serve as tidy-like tools for fixing broken XML. This does not seem to be the case.

As I've said in the past, I don't claim that only XML parsers and processors should be used to work with XML. Heck, I use grep, wc, sed and the usual text tools all the time with XML documents. I do say that it is dishonest to call something an XML parser or processor unless you treat non-compliance as a bug. I guess it's the old social principle all over again. XML is hot, so it's voguish to be called an XML processor, yet it's all so tempting to shirk the required work.

[Uche Ogbuji]

via Copia

Firing SAX events from a DOM tree in 4Suite

One nice thing about the briskly-moving 4Suite documentation project is that it is shining a clear light on places where we need to make the APIs more versatile. Adding convenience parse functions was one earlier result.

Saxlette has the ability to walk a Domlette tree, firing off events to a handler as if from a source document parse. This ability used to be too well, hidden, though, and I made an API addition to make it more readily available. This is the new Ft.Xml.Domlette.SaxWalker. The following example should show how easy it is to use:

from Ft.Xml.Domlette import SaxWalker
from Ft.Xml import Parse

XML = ""

class element_counter:
    def startDocument(self):
        self.ecount = 0

    def startElementNS(self, name, qname, attribs):
        self.ecount += 1

#First get a Domlette document node
doc = Parse(XML)
#Then SAX "parse" it
parser = SaxWalker(doc)
handler = element_counter()
#You can set any properties or features, or do whatever
#you would to a regular SAX2 parser instance here
parser.parse() #called without any argument
print "Elements counted:", handler.ecount

Again Saxlette and Domlette are fully implemented in C, so you get great performance from the SaxWalker.

[Uche Ogbuji]

via Copia

Python/XML column #37 (and out): Processing Atom 1.0

"Processing Atom 1.0"

In his final Python-XML column, Uche Ogbuji shows us three ways to process Atom 1.0 feeds in Python. [Sep. 14, 2005]

I show how to parse Atom 1.0 using minidom (for those who want no additional dependencies), Amara Bindery (for those who want an easier API) and Universal Feed Parser (with a quick hack to bring the support in UFP 3.3 up to Atom 1.0). I also show how to use DateUtil and Python 2.3's datetime to process Atom dates.

As the teaser says, we've come to the end of the column in its present form, but it's more of a transition than a termination. From the article:

And with this month's exploration, the Python-XML column has come to an end. After discussions with my editor, I'll replace this column with one with a broader focus. It will cover the intersection of Agile Languages and Web 2.0 technologies. The primary language focus will still be Python, but there will sometimes be coverage of other languages such as Ruby and ECMAScript. I think many of the topics will continue to be of interest to readers of the present column. I look forward to continuing my relationship with the audience.

It is too bad that I don't get to some of the articles that I had in the queue, including coverage of lxml pygenx, XSLT processing from Python, the role of PEP 342 in XML processing, and more. I can still squeeze some of these topics into the new column, I think, as long as I make an emphasis on the Web. I'll also try to keep up my coverage of news in the Python/XML community here on Copia.

Speaking of such news, I forgot to mention in the column that I'd found an interesting resource from John Shipman.

[F]or my relatively modest needs, I've written a more Pythonic module that uses minidom. Complete documentation, including the code of the module in 'literate programming' style, is at:

The relevant sections start with section 7, "".

[Uche Ogbuji]

via Copia

Live Markdown Compilation via 4XSLT / 4Suite Repository

Related to uche's recent entry about PyBlosxom + CherryPy, I recently wrote a 4XSLT extension that compiles a 4Suite Repository RawFile (which holds a Markdown document) into an HTML 4.1 document on the fly. I'm using it to host a collaborative markdown-based Wiki.

The general idea to allow the Markdown document to reside in the repository and be editable by anyone (or specific users). The raw content of that document can be viewed with a different URL: . That is the actual location of the file, the previous URL is actually a REST service setup with the 4Suite Server instance running on metacognition that listens for requests with a leading /markdown and redirects the request to a stylesheet that compiles the content of the file and returns an HTML document.

The relevant section of the server.xml document is below:

         xslt-transform='/extensions/RenderMarkdown.xslt'   />

This makes use of a feature in the 4Suite Repository Server architecture that allows you to register URL patterns to XSLT transformations. In this case, all incoming requests for paths with a leading /markdown are interpreted as a request to execute the stylesheet /extensions/RenderMarkdown.xslt with a top-level path parameter which is the full path to the markdown document (/markdown-documents/RDFInterfaces.txt in this case). For more on these capabilities, see: The architecture of 4Suite Web applications.

The rendering stylesheet is below:

<?xml version="1.0" encoding="UTF-8"?>
        extension-element-prefixes="exsl md fcore"
        exclude-result-prefixes="fcore ftext exsl md xsl">
          doctype-public="-//W3C//DTD HTML 4.01 Transitional//EN" 
        <xsl:param name="path"/>
        <xsl:param name="title"/>
        <xsl:param name="css"/>
        <xsl:template match="/">        
            <title><xsl:value-of select="$title"/></title>         
            <link href="{$css}" type="text/css" rel="stylesheet"/>
            <xsl:copy-of select="md:renderMarkdown(fcore:get-content($path))"/>

This stylesheet makes use of a md:renderMarkdown extension function defined in the Python module below:

from pymarkdown import Markdown
    from Ft.Xml.Xslt import XsltElement,ContentInfo,AttributeInfo
    from Ft.Xml.XPath import Conversions
    from Ft.Xml import Domlette


    def RenderMarkdown(context, markDownString):
        dom = Domlette.NonvalidatingReader.parseString(str(rt),"urn:uuid:Blah")
        return [dom.documentElement]

    ExtFunctions = {
        (NS, 'renderMarkdown'): RenderMarkdown,

Notice that the stylesheet allows for the title and css to be specified as parameters to the original URL.

The markdown compilation mechanism is none other than the used by Copia.

For now, the Markdown documents can only be edited remotely by editors that know how to submit content over HTTP via PUT as well as handle HTTP authentication challenges if met with a 401 for a resource in the repository that isn't publicly available (in this day and age it's a shame there are only a few such editors - The one I use primarily is the Oxygen XML Editor).

I hope to later add a simple HTML-based form for live modification of the markdown documents which should complete the very simple framework for a markdown-based, 4Suite-enabled mini-Wiki.

Chimezie Ogbuji

via Copia