XML recursive directory listing, part 2

In part 1 I started to talk about dueling iterations for the use-case of using Python's os.walk() to emit a nested XML representation of a directory listing. I presented a working, but unsatisfactory approach and left off until part 2. Eric Gaumer wasted no time covering one of the key angles, so go read his follow- up.

It's the classic approach of turning recursion into iteration by managing one's own stack, which adds a lot more flexibility at the expense of a bit more opaque code. In this case it's not so bad because there is the old os.path.walk() standby that subsumes the recursive call-back. Eric uses a closure, though he doesn't need to (it's a good choice, though, if just for modularity).

Another place to turn for a bit of assistance is the XML API. 4Suite's MarkupWriter is a streaming output API, and so you pretty much have to process the file in the order in which you'll write their output. It would be neat if it supported modes or bookmarks, where you could move a "cursor" around to produce different sections of output. I know some tools in other languages have such facilities, and I've often considered adding these to MarkupWriter, using the power of Python's generators. Maybe this discussion will spur me on to doing so.

But there is also the fall-back of a node-based output API. I discussed the contrast between stream and node-based XML writers in "Proper XML output with new APIs in 4Suite and Amara". The following is equivalent code using Amara :

import os
import sys
from amara import binderytools

root = sys.argv[1]

doc = binderytools.create_document()
name = unicode(root)
doc.xml_append(
    doc.xml_element(u'directory', attributes={u'name': name})
)
dirs = {root: doc.directory}

for cdir, subdirs, files in os.walk(root):
    cdir_elem = dirs[cdir]
    name = unicode(cdir)
    for f in files:
        name = unicode(f)
        cdir_elem.xml_append(
            doc.xml_element(u'file', attributes={u'name': name})
            )
    for subdir in subdirs:
        full_subdir = os.path.join(root, subdir)
        name = unicode(full_subdir)
        subdir_elem = doc.xml_element(u'directory',
                                      attributes={u'name': name})
        cdir_elem.xml_append(subdir_elem)
        dirs[full_subdir] = subdir_elem

print doc.xml(indent=u"yes")  #Print it

It's not actually as much of a simplification as I'd thought it would be while working it out in my head. It's certainly more linear, but the need to track the mapping from directory name to directory element node adds back the cognitive load saved by eliminating the recursion. Ah well, it's another example.

Meanwhile, Dave Pawson had taken off with the example from yesterday and turned it into a full-fledged command-line utility, dirlist.py . It's long, so I posted it for download rather than in-line. Dave Pawson has more on his blog. Interesting journey, but thanks to Python, he was happy with the result.

[Uche Ogbuji]

via Copia

4Suite 1.0b1 via yum?

Dave Pawson was asking how to grab 4Suite using yum. I'm still yet to post a follow-up based on Dave's earlier question, and thanks to Eric Gaumer for carrying on the thread in some of the direction I'd planned, and I'll try to get back to that topic today. Anyway, Dave and I weren't really successful getting 4Suite 1.0b1 yum. I'm posting here for reference to our journey, and in the hopes that someone can help.

I use apt rather than yum, so i had to remember the right yum mojo again, but I started by looking at what I had on my system:

# rpm -q 4Suite
4Suite-1.0-3

OK. That's odd. 4Suite 1.0 is still in beta, so that's a strange version number. So I found out the real version number:

# rpm -ql 4Suite | grep "Xml/__packageInfo__.py$" | xargs grep

"^version" version = '1.0a3'

Ah. I see now. They omitted the "a" part. Well, it's one 4Suite release behind—not bad, but there are so many improvements in 4Suite 1.0b1 that you should really get the latest.

I went looking on google and found a promising candidate, 4Suite-1.0-8.b1.i386. This looks like it's in fedora-devel, so I tried looking at how to add that repository. I found help on aaltonen.us, where you can find the following yum repo spec:

[development] 
name=Fedora Core $releasever - Development Tree
#baseurl=http://download.fedora.redhat.com/pub/fedora/linux/core/development/$basearch/
mirrorlist=http://fedora.redhat.com/download/mirrors/fedora-core-rawhide
enabled=1
gpgcheck=1

I handed this off to Dave to try out (turned out the magic incantation is yum install 4Suite.i386). But the resulting chain of dependencies was way too far out on the bleeding edge. Dave was seeing updates to the likes of "perl, python, libxml, mysql kde, gnome, k3b the list goes on!":

I can't see that this is a true dependency from 4suite Uche?
Error: Missing Dependency: libdb_cxx-4.2.so is needed by package openoffice.org-libs
Error: Missing Dependency: libedataserver.so.3 is needed by package openoffice.org
Error: Missing Dependency: libebook.so.8 is needed by package openoffice.org
Error: Missing Dependency: gcc = 3.4.3-22.fc3 is needed by package gcc-g77

Oops. Ouch. The problem with the RPMs seems to be that fedora core is still testing the transition from 4Suite 1.0a3 to 1.0b1, and that's quite understandable. I look forward to seeing the more recent version in fedora core base.

At this point I advised David to ditch yum, just use the .src.rpm from the official 4Suite download and use rpmbuild to make himself a package. That also turned out to be a dead end: the spec file in the 1.0b1 release appears to be borked. Our fault. Ay ay ay. One of those days. I'll make sure it's fixed before the next release.

In the end Dave installed 4Suite from source, using "setup.py install", and all was well. I should have just told him to do that from the start.

Meanwhile, some notes from the fedora-devel 4Suite-1.0-8.b1 RPM.

The description is way out of date. I think it's 2 years old or more. For one thing 4Suite hasn't included 4DOM in aeons. I suggest the Fedora maintainers take the description from 4Suite.org.

Also, it requires "PyXML >= 0.7", but we dropped that requirement in the 4Suite 1.0b1 release.

Finally, it says "python-abi=2.4" is required. I suppose that might be FC3 maintainer preference, but I did want to mention that Python 2.2.3 is sufficient (though we do recommend 2.3.5).

[Uche Ogbuji]

via Copia

Python/XML community:

xmldiff 0.6.7
Picket

Xmldiff is a utility for extracting differences between two xml files. It returns a set of primitives to apply on source tree to obtain the destination tree.

LogiLab's Xmldiff is interesting for several reasons, including the fact that it uses XUpdate to represent the XMl differences. You can then use 4Suite's command-line XUpdate tool (or any other tool you like) to "patch" XML files with the diff. See Sylvain Thénault's announcement.

Picket is a CherryPy XSLT filter developed by Sylvain Hellegouarch.

The Picket filter is a simple CherryPy filter for processing XSLT as a template language. It uses 4Suite to do the job.

Nice. Preliminary inspection seems to recommend it as a good example of 4XSLT in server architecture in general. It makes good use of the API, and even implements processor object pooling (helps performance). As the CherryPy tutorial says,

A filter is an object that has a chance to work on a request as it goes through the usual CherryPy processing chain.

[Uche Ogbuji]

via Copia

XML recursive directory listing, part 1

Dave Pawson asked for help with using Python's os.walk() to emit a nested XML representation of a directory listing. The semantics of os.walk make this a bit awkward, and I have a good deal to say on the matter, but I first wanted to post some code for David and others with such a need before diving into fuller discussion of the matter. Here's the code.

import os
import sys

root = sys.argv[1]

from Ft.Xml import MarkupWriter
writer = MarkupWriter(indent=u"yes")

def recurse_dir(path):
    for cdir, subdirs, files in os.walk(path):
        writer.startElement(u'directory', attributes={u'name': unicode(cdir)})
        for f in files:
            writer.simpleElement(u'file', attributes={u'name': unicode(f)})
        for subdir in subdirs:
            recurse_dir(os.path.join(cdir, subdir))
        writer.endElement(u'directory')
        break

writer.startDocument()
recurse_dir(root)
writer.endDocument()

Save it as dirwalker.py or whatever. The following is sample usage (in UNIXese):

$ mkdir foo
$ mkdir foo/bar
$ touch foo/a.txt
$ touch foo/b.txt
$ touch foo/bar/c.txt
$ touch foo/bar/d.txt
$ python dirwalker.py foo/
<?xml version="1.0" encoding="UTF-8"?>
<directory name="foo/">
  <file name="a.txt"/>
  <file name="b.txt"/>
  <directory name="foo/bar">
    <file name="c.txt"/>
    <file name="d.txt"/>
  </directory>
</directory>[uogbuji@borgia tools]$ rm -rf foo
$

Notice that the code is really preempting the recursiveness of os.walk in order to impose its own recursion. This is the touchy issue I want to expand on. Check in later on today...

[Uche Ogbuji]

via Copia

A couple of Amara/CherryPy Demos

As I've mentioned, I've been playing with Amara/4Suite and CherryPy. Luis Miguel Morillas has been as well. We're both taking things slowly, pursuing it from different angles.

Luis has a "Web-based docbook browser and processor using CherryPy and Amara.". It's a very simple script for rendering as Web content an index and chapters of Mark Pilgrim's Dive into Python book as XML and XML+CSS (which seems to be creeping into the mainstream?).

I also have a demo as part of Amara, cherrypy-xml- inspector.py, which allows you to "inspect" an XML document, through a Web form using CherryPy and Amara. You can load any document off the Web and then enter in an amara expression, such as "doc.html.head.title" and get the result.

[Uche Ogbuji]

via Copia

Some 4Suite repository extension

If you just want to try out some handy XSLT extension modules for 4Suite's repository and skip all the blather, just scroll to the bottom of this item...

Akara is an extensible information gathering and presentation framework implemented in 4Suite.

As I describe it on the site:

In simple terms, you put notes into Akara (like a notepad). You put FAQ entries in (like a FAQ wizard). You put links and comments on those lings (like a Web log or bookmark manager). You put discussion logs in (like mailing list archives and instant messaging logs). You put code examples, articles, proposals, specifications, stories and reviews in (like a content manager). You put it all where it's convenient for the moment (like a Wiki). You can later on reorganize things relatively easily (like, ummm... like what?). You can see an example of Akara in action on my Akara site on XML processing in Python

I never really got it mature enough for release, in part because it's the project that finally left me gob-smacked with the sense that although 4Suite's core libraries are super-useful, the server framework is rather rickety and could do with a lot less wheel reinvention (I've discussed this matter with regard to my recent advocacy of CherryPy as a protocol server backbone for 4Suite after 1.0).

Anyways I'm rebuilding Akara to be a proof of concept of 4Suite repository/CherryPy integration. It's going slowly due to workload, and since many of the Akara XSLT extension modules are useful independently from Akara, I'm posting them here for now. They are:

  • cachetool.py—an extension for caching results of common and slow XSLT templates. I use this heavily to cache the XML results of Versa queries in 4Suite. It stores and manages the caches as XML resources in the repository, with a given time-to-live. There is also a method to invalidate a cached value.
  • calwidget.py—a widget that inserts an XHTML calendar into the XSLT output
  • emailftext.py—a widget for reading UNIX mailboxes and using XSLT dispatch to process the items, and to send messages. Not vetted for security
  • feedtools.py—an extension for RSS aggregation. Uses Mark Pilgrim's Universal Feed Parser to read a list of feeds given by URL and then write the result to the XSLT output as a consolidated RSS 1.0 feed. You probably want to use this together with cachetool.py so it's not retrieving feeds on every request.
  • akaraftext.py—parses Akara markup (a wiki-like language) and inserts XHTML into the output stream

[Uche Ogbuji]

via Copia

CherryPy 2.0

CherryPy

After several months of hard work the first stable release of CherryPy2 is finally available. Downloads are available here and the ChangeLog can be viewed here.

Remi Delon announced the 2.0 release of CherryPy. It's my favorite entry in the the Python Web frameworks sweepstakes. It's very simple to learn and use, and it just makes sense. Very few surprising conventions. My own endorsement is among the many testimonials CherryPy has picked up

I'm also pulling for CherryPy to form the heart of the protocol server for the next generation of 4Suite. As I said on the CherryPy discussion board:

I have a nefarious agenda: I regret our having reinvented some wheels in 4Suite, and most especially the Web framework wheel. To be fair to us, the likes of CherryPy were not available at the time and it was pretty much Zope, Webware, mod_python or bust, and we didn't like any of those options. But now we're saddled with really not-that-great re-implementations of HTTP [server] framework, session management, etc, all too tightly coupled into the XML database for my liking. I'd like to move to a more open architecture that decouples core XML libraries from XML DBMS from protocol framework (with CherryPy ideally as the latter). That way, [someone] could get CherryPy, and if they liked, a simple XML processing plug in, and if they liked, an XML DB plug in, and so on. If I can get [something] working sweet as sugar with CherryPy, I bet I could convice my fellow 4Suite developers to leave the Web frameworks to the dedicated Web frameworks projects.

I've been plugging slowly away with these ideas, but it's been hard to get to it with all the other items in the work queue. Perhaps this announcement will spur me to get something into shape.

[Uche Ogbuji]

via Copia

XsltRenderer for PyBlosxom

XsltRenderer.py

The XSLT renderer is a renderer that takes any output from the default renderer (i.e. after processing using flavors), and applies an XSLT, using the result. If you have handy XSLTs that start with XML vocabularies such as Atom or XHTML, you can create such output just as you currently do in PyBlosxom, by appliying your flavor of choice. You can then use the XSLTs you have to produce additional types of output quite easily.

I used this to add an RSS 1.0 feed here on Copia. It starts with the Atom flavor, and uses Antonio Cavedoni's XSLT to translate the atom to RSS 1.0. Taking this approach in general made thinsg a lot easier for me. I went from nothing to RSS 1.0 feed in about two hours of hacking, which included a lot of learning about the plug-in API. Based on the hassles that other PyBlosxom feed projects seemed to be working through, I think the detour through XSLT was quite valuable. It made it easier to lean on the shoulders of others outside the PyBlosxom community.

Anyway some other sample uses:

  • A very simple "plain text" feed. Start with the Atom flavor and then use a tiny (5 lines or so) XSLT to strip all tags
  • An alternative, image-free view. Start with a flavor that generates XHTML, then apply a tiny (5 lines or so) XSLT to strip all images

There is one configuration option for this plug-in-- xslt_trigger_suffix. See the module header doc for more info.

[Uche Ogbuji]

via Copia

Notes on Porting an XSLT/HTML Application to XForms/SOAP

Motivation

Several years ago, I wrote a front-end to 4Suite that fullfilled the following requirements:

  • Default Content management
  • System management
  • Adhoc RDF Query interface

It was to be written as a set of XSLT stylesheets which generated HTML pages composed mostly of HTML forms. The 4Suite repository consists of user-specified content (XML/RDF and otherwise) as well as system resources: XML documents that provide repository-functionality such as user-management, servers, XML-to-RDF mapping, etc. The idea was to build special user interface for managing these system documents. The less satisfactory alternative is to modify the raw XML. This would require an understanding of the structure of these documents as they were being modified and introduced the possibility of creating invalid documents which broke their expected functionality.

At the time, I thought that XSLT alone was the perfect means to do this because of the whole slew of extensions available for managing resources in the underlying 4Suite repository. Mike Brown wrote a very concise overview of what the 4Suite repository is, available here. There is also a useful overview on the architecture of 4Suite's XSLT-based web application framework.

In the end, this project (called the 4Suite Dashboard) became very difficult to maintain because of the spaghetti-like nature of the XSLT. There are two factors in this:

  • At the time when I wrote it, I was less adept at using XSLT to its strengths
  • The cumbersome, ad-hoc processing of form data - which was the primary component of the user interface

As a result, it has slowly lagged behind with the rest of 4Suite and is essentially unusable because of the inordinate amount of effort that would be neccessary to refactor it to a more maintainable state. Motivated mostly by the great success we have had in the Cleveland Clinic Foundation with research regarding the use of XForms as the primary means of serving user-interfaces to a semi-structured, metadata-driven database, I decided to port the old Dashboard code to an XForms/SOAP-based solution.

The XForm 'sweet-spot'

The primary motivating factor was the idea that with XForms you kill several birds with one stone:

  1. You move a majority of XML processing to the client
  2. You reduce the complexity of request processing by piggy-backing on the XForms approach to Web applications
  3. Makes for an overall cleaner architecture by seperating what the user sees from the actual processing (by the application) of the user's actions within the user-interface
  4. Session management: SOAP messages can be submitted within an existing session.

The result was a cleaner, leaner application that was much easier to implement, given my better appreciation for XSLT as the framework for an application as well as my familiarity with XForms. Below is a high-level diagram of the main components:

XDashboard Architecture

Secure, remote service invokations

One of the goals of the port was to demonstrate the submission of session-managed SOAP messages. By having a session created at the server when a request to manage a resource came, the session id can be passed along with the resulting XForm so all subsequent service requests will authenticate at the repository using this session id (generated at the server). Since the session is specific to the 4Suite user that requested the XDashboard screen (an HTTP authentication request is sent when the application is originally loaded, requiring a valid 4Suite user to enter their credentials), service requests on resources not available to the user will fail with an appropriate SOAP Fault detailing the server-side security violation.

Base64 encoded XML content from an XForm

The other interesting thing I was able to demonstrate was the usefullness of submitting XML strings as base64 encoded content via SOAP. One of the primary arguments against SOAP as a remote procedure protocol is it's use of a verbose syntax as the medium for communciation (XML). Now imagine a SOAP message whose purpose was for modifying the content of an existing XML resource. The instinctive first solution would probably be to submit the XML document as a fragment within the SOAP envelope like so:

[SOAP:Envelope]
   [SOAP:Body]
     [foo:setContent]
       [path] .. path to document [/path]
       [src]... new document as a fragment ...[/src]
     [/foo:setContent]
   [/SOAP:Body]
[/SOAP:Envelope]

But imagine the extra processing that the SOAP endpoint must contend with when you consider the SOAP message as a whole. 4Suite's SOAP server allows content to be submitted as plain text or as Base64 encoded content. In addition, the XForm's upload component is restricted only for nodes with the following datatypes:

  • xsd:anyURI
  • xsd:base64Binary
  • xsd:hexBinary

The result of this is that for nodes bound to xsd:base64Binary, an XForms processor is responsible for Base64 encoding data selected via the xforms:upload component for submission, which simplifies the problem for the case where the XML content you wish to submit is uploaded from a file on the local filesystem of the client. However, the previous dashboard allowed XML content for an arbitrary resource in the repository to be submitted from a textarea. In the XForms scenario, this caused the requirement of having a javascript function do this encoding explicitely and binding the encoding to text collected from a textarea.

Ironically, at the time when I was dealing with this problem there was an ongoing thread in the W3C's www-forms list about the ins/outs of encoding XML content as strings for submission from an XForm.

The XDashboard Services

The following is the list of services setup and used by the application (with an accompanying description of what each does):

  • addAcl (Add an acl identifier to the acl key. If the specified key does not exist on the resource this function has the same functionality of setAcl)
  • createContainer (Create's a 4Suite repository Container)
  • createDocument (Creates a document with the given document definition name, path, type, and source. if the type is not specified it will attempt to infer the type based on IMT and the src)
  • createRawFile (Creates a raw file resource, using the specified, path, internet media type, and source string.)
  • delete (Delete this object from the repository)
  • fetchResource (Fetch a resource from the system. path is a string)
  • getContent (Get the string content of this resource)
  • removeAcl (Remove the acl identifier from the acl key. If the specified aclKey is not present on the resource or the specified aclIdent is not in the key, then do nothing. Remember that entries inherited from the parent are not included in the acl on the actual resurce. In order to override inherited permissions, set to no access rather than trying to remove)
  • setAcl (Replace the aclKey with a single entry of aclIdent)
  • setContent (Set the string content of this resource)
  • setImt (Sets the Internet Media Type of the raw file resource)
  • setPassword (Change the password of the specified user to the SHA-1 hashed value)
  • setServerRunning (change the state of a 4Suite repository server to running/stopped. This operation is executed as an XUpdate at the server-side modifying the Server document to reflect the correct state. The XUpdate document is serialized from the client and submitted to the repository).
  • xUpdate (Allows XML content to be updated with the XUpdate protocol. updateSrc is a string that represents the XUpdate to be applied to the resource. extraNss is a dict of additional namespace bindings to pass to the XUpdate processor, if necessary)

Compliance Notes

As FormsPlayer is probably the most mature of all the XForms processors available (the list is growing), it was the targeted XForms processor for this application. For the most part, this doesn't introduce any issues of non-compliance with XForms as everything was done using mostly XForms 1.0, a little XForms 1.1 (xforms:duplicate action, primarily), and 2 FormsPlayer specific capabilities.

It must be mentioned, however, that FormsPlayer is an Internet Explorer plugin solution to XForms. The tradeoff, essentially, is browser compatibility for the full complement of XForms functionality that comes with FormsPlayer. Below is a briefing of the deviations from pure XForms 1.0:

XForms 1.1 constructs

xforms:duplicate was used for the copying of nodes from a source to an origin

Forms Player specific capabilities

  • xforms:setvalue was used for deep-copying nodes to a target location without destroying existing childnodes (the limitation of it's XForms 1.0 counterpart: xforms:copy)
  • fp:serialise function was used to facilitate the retrival of a node's XML representation in order to Base64 encode for submission
  • fp:HTMLserialise was used for debugging purposes
  • Forms Player's inline capability was used in order to access external javascript functions from within XPath expressions

Resources

This application relies on the most recent version of 4Suite's SOAP server (can be retrieved from CVS). A listing of the most recent version of this SOAP server can be found here. The XDashboard is bundled as a 4Suite repository application and so must be installed to a running repository using the 4ss install command. It should be sufficient to unpack the tar / zip ball and run 4ss install against the setup.xml file.

The XDashboard application can be downloaded from as a tgz archive or zip archive from:

Bear in mind, this application is a proof of concept / demo, so it's likely to have undiscovered typos/bugs

This demo/application makes use of and refers to the following third-party resources:

  • Mike Brown and Jeni Tennison's tree-view.xslt - For rendering a decorated view of an XML document
  • Micah Dubinko's XForms Relax NG schema
  • MSXML's default IE XML DHTML stylesheet [see]: - For rendering a decorated view of an XML document
  • xslt2xform's 'Powered by XForm' logo - Pending a winner to Micah's logo contest

[Chimezie Ogbuji]

via Copia

Copia gets RSS 1.0, courtesy XSLT

I added RSS 1.0 feeds to Copia yesterday. It was a fun Sunday evening hack project, and even though parts of PyBlosxom still make me scratch my head, I'm in even more awe of its hackability.

It seems RSS 1.0 has always been a sketchy area for PyBlosxom. Because RSS 1.0 essentially needs 2 modes, the item list and the item details, you can't emit it linearly, and so you can't use a PyBlosxom flavor, as with RSS 0.91 or atom. I found some discussion of an RSS 1.0 from time to time in the archives, including this Perl port, but nothing readily usable, so I had to roll my own.

But what does an XML head do when faced with such a task? He runs a simple equation: existing atom flavor for PyBlosxom + plenty of Atom to RSS 1.0 XSLT transforms out there = decision to implement an XsltRenderer for PyBlosxom. The XsltRenderer can take a flavor's output and run it through an XSLT to produce the final output.

More on that in a follow-up item, but for now, I've added the whole site feed to the feed discovery convention in the HTML headers, and also to the right hand listing. There are also topic-specific feeds for all keywords.

So, for example, here is the RSS 1.0 Python feed, and here is the XML Atom feed. Note: we use lowercase for keywords on Copia.

[Uche Ogbuji]

via Copia