Updating Metacognition Software Stack

Metacognition was down for some time as I updated the software stack that it runs on (namely 4Suite and RDFLib). There were some core changes:

  • 4Suite repository was reinitialized using the more recent persistence drivers.
  • I added a seperate section for archived publications (I use Google Reader's label service for copia archives)
  • I switched to using RDFLib as the primary RDF store (using a recently written maping between N3/FOL and a highly efficient SQL schema) and the filesystem for everything else
  • I added the SIOC ontology to the core index of ontologies
  • Updated my FOAF graph and Emeka's DOAP graph

Earlier, I wrote up the mapping in a formal notation (using MathML so it will only be viewable in a browser that supports it - like firefox) that RDFLib's FOPLRelationalModel was based on.

In particular, it's incredibly more responsive and better organized. Generally, I hope for it to serve two purposes: 1) Organize my thoughts on and software related to applying Semantic Web Technologies to 'closed' content management systems 2) Serve as a (markdown-powered) whiteboard and playground for tools / demos for advocacy on best practices in problem solving with these technologies.

Below is a marked-up diagram of some of these ideas.

Metacognition-Roadmap

The publications are stored in a single XML file and are rendered (run-time at the server) using a pre-compiled XSLT stylesheet against a cached Domlette. Internally the document is mapped into RDF persistence using an XSLT document definition associated with the document so all modifications are synched into an RDF/XML equivalent.

Mostly as an academic exercise - since the 4Suite repository (currently) doesn't support document definitions that output N3 and the content-type of XSLT transform responses is limited to HTML/XML/Text - I wrote an equivalent publications-to-n3.xslt. The output is here.

Chimezie Ogbuji

via Copia

"Tip: Remove sensitive content from your XML samples with XSLT"

"Tip: Remove sensitive content from your XML samples with XSLT"

Do you need to share samples of your XML code, but can't disclose the data? For example, you might need to post a sample of your XML code with a question to get some advice with a problem. In this tip, Uche Ogbuji shows how to use XSLT to remove sensitive content and retain the basic XML structure.

I limited this article to erasing rather than obfuscating sensitive content, which can be done with XSLT 1.0 alone. With EXSLT (or even XSLT 2.0) you can do some degree of obfuscation, allowing you to possibly preserve elements of character data that are important to the problem under discussion. Honestly, though, I prefer to solve this problem with even more flexible tools. As a bonus the following is a bit of 4Suite/SAX code that uses a SAX filter to obfuscate character data by adding a random shift to the ordinal of each character in the Unicode alphanumeric class. This way if exotic characters were part of the problem you're demonstrating, they'd be left alone. It's easy to use the code as a template, and usually all you have to change is the obfuscate function or the obfuscate_filter class in order to fine-tune the workings.

import re
import random
from xml.sax import make_parser, saxutils
from Ft.Xml import CreateInputSource, Sax

RANDOM_AMP = 15
ALPHANUM_PAT = re.compile('\w', re.UNICODE)

def obfuscate(old):
    def mutate(c):
        return unichr(ord(c.group())+random.randint(-RANDOM_AMP,RANDOM_AMP))
    return ALPHANUM_PAT.subn(mutate, old)[0]

class obfuscate_filter(saxutils.XMLFilterBase):
    def characters(self, content):
        saxutils.XMLFilterBase.characters(self, obfuscate(content))
        return

if __name__ == "__main__":
    XML = "http://cvs.4suite.org/viewcvs/*checkout*/Amara/demo/labels1.xml"
    parser = make_parser(['Ft.Xml.Sax'])
    filtered_parser = obfuscate_filter(parser)
    handler = Sax.SaxPrinter()
    filtered_parser.setContentHandler(handler)
    filtered_parser.parse(CreateInputSource(XML))

This code uses recent fixes and capabilities I checked into 4Suite CVS last week. I think all the needed details to understand the code are in the SAX section of the updated 4Suite docs, which John Clark has already posted.

[Uche Ogbuji]

via Copia

Optimizing XML to RDF mappings for Content Management Persistence

I recently re-factored the 4Suite repository's persistent layer for the purpose of making it more responsive to large sets of data. The 4Suite repository's persistence stack – which consists of a set of core APIs for the various first class resources - is the heart and sole of a framework that leverages XML and RDF in tandem as a platform for content management. Essentially, the changes minimized the amount of redundant RDF statements mirrored into the system graph (an RDF graph where provenance statements about resources in the virtual filesystem are persisted) from the metadata XML documents associated with every resource in the repository.

The ability to mirror RDF content from XML documents in a controlled manner is core to the repository and the way it manages it's virtual filesystem. This mapping is made possible by a mechanism called document definitions. Document definitions are mappings (persisted as documents in the 4Suite repository) of controlled XML vocabularies into corresponding RDF statements. Every resource has a small 'metadata' XML document associated with it that captures ACL data as well as system-level provenance typically associated with filesystems.

For example, the metadata document for the root container of the 4Suite instance running on my laptop is:

<?xml version="1.0" encoding="utf-8"?>
<ftss:MetaData 
  xmlns:ftss="http://xmlns.4suite.org/reserved" 
  path="/" 
  document-definition="http://schemas.4suite.org/4ss#xmldocument.null_document_definition"   
  type="http://schemas.4suite.org/4ss#container" creation-date="2006-03-26T00:35:02Z">
  <ftss:Acl>
    <ftss:Access ident="owner" type="execute" allowed="1"/>  
    <ftss:Access ident="world" type="execute" allowed="1"/> 
    <ftss:Access ident="super-users" type="execute" allowed="1"/>  
    <ftss:Access ident="owner" type="read" allowed="1"/>
    <ftss:Access ident="world" type="read" allowed="1"/>    
    <ftss:Access ident="super-users" type="read" allowed="1"/>  
    <ftss:Access ident="owner" type="write user model" allowed="1"/>
    <ftss:Access ident="super-users" type="write user model" allowed="1"/>  
    <ftss:Access ident="owner" type="change permissions" allowed="1"/>  
    <ftss:Access ident="super-users" type="change permissions" allowed="1"/>
    <ftss:Access ident="owner" type="write" allowed="1"/> 
    <ftss:Access ident="super-users" type="write" allowed="1"/> 
    <ftss:Access ident="owner" type="change owner" allowed="1"/> 
    <ftss:Access ident="super-users" type="change owner" allowed="1"/>
    <ftss:Access ident="owner" type="delete" allowed="1"/>
    <ftss:Access ident="super-users" type="delete" allowed="1"/>
  </ftss:Acl>
  <ftss:LastModifiedDate>2006-03-26T00:36:51Z</ftss:LastModifiedDate>
  <ftss:Owner>super-users</ftss:Owner>
  <ftss:Imt>text/xml</ftss:Imt>
  <ftss:Size>419</ftss:Size>
</ftss:MetaData>

Each ftss:Access element under ftss;Acl represents an entry in the ACL associated with the resource the metadata document is describing. All the ACL accesses enforced by the persistence layer are documented here.

Certain metadata are not reflected into RDF, either because they are queried more often than others and require prompt response or because they are never queried separately from the resource they describe. In either case, querying a small-sized XML document (associated with an already identified resource) is much more efficient than dispatching a query against an RDF graph in which statements about every resource in the repository are assserted.

ACLs are an example and are persisted only as XML content. The persistence layer interprets and performs ACL operations against XML content via XPath / Xupdate evaluations.

Prior to the change, all of the other properties embedded in the metadata document (listed below) were being reflected into RDF redundantly and inefficiently:

  • @type
  • @creation-date
  • @document-definition
  • ftss:LastModifiedDate
  • ftss:Imt
  • ftss:Size
  • ftss:Owner
  • ftss:TimeToLive

Not too long ago, I hacked (and wrote a piece on it) up an OWL ontology describing these system-level RDF statements.

Most of the inefficiency was due to the fact that a pre-parsed Domlette instance of the metadata document for each resource was already being cached by the persistence layer. However the corresponding APIs for these properties (getLastModifiedDate, for example) were being implemented as queries against the mirrored RDF content. Modifying these methods to evaluate pre-compiled XPaths against the cached DOM instances proved to be several orders of magnitudes more efficient, especially against a repository with a large number of resources in the virtual filesystem.

Of all the above 'properties', only @type (which was being mirrored as rdf:type statemements in RDF), @document-definition, and ftss:TimeToLive were being queried independently from the resources they are associated with. For example, the repository periodically monitors the system RDF graph for ftss:TimeToLive statements whose values are less than the current date time (which indicates their TTL has expired). Expired resources can not be determined by XPath evaluations against metadata XML documents, since XPath is scoped to a specific document by design. If the metadata documents were persisted in a native XML store then the same queries could be dispatched (as an XQuery) across all the metadata documents in order to identify those whose TTL had expired. But I digress...

The @document-defintion attribute associates the resource (an XML document in this case) with a user-defined mapping (expressed as an XSLT transform or a set of XPath to RDF statement templates) which couples it's content with corresponding RDF statements. This presents a interesting scenario where if a document definition changes (document definitions are themselves first-class resources in the repository), then all the documents which refer to it must have their RDF statements re-mapped using the new document definition.

Note, such coupling only works in a controlled, closed system and isn't possible where such mappings from XML to RDF are uncontrolled (ala GRDDL) and work in a distributed context.

At any rate, the @document-definition property was yet another example of system metadata that had to be mirrored into the system RDF graph since document definitions need to be identified independently from the resources that register them.

In the end, only the metadata properties that had to be queried in this fashion were mirrored into RDF. I found this refactoring very instructive in identifying some caveats to be aware of when modeling large scale data sets as XML and RDF interchangeably. This very small architectural modification yielded quite a significant performance boost for the 4Suite repository, which (as far as I can tell) is the only content-management system that leverages XML and RDF as interchangeable representation formats in such an integrated fashion.

[Uche Ogbuji]

via Copia

Adding feeds to Liferea on the command line

Despite the kind help of the Rojo people I still can't get the service to import my updated feed lists ('An error has occurred...Failed to import: null...We apologize for the inconvenience.'), so I'm still reading my Web feeds on Liferea for now. One nice bonus with Liferea is the ability to add feeds from the command line (or really, any program) courtesy GNOME's DBUS. Thanks to Aristotle for the tip, pointing me to 'a key message on liferea-help'. I've never used DBUS before, so I may be sketchy on some details, but I got it to work for me pretty easily.

I start with a simple script to report on added feed entries. It automatically handles feed lists in OPML or XBEL (I use the latter for managing my feed lists, and Liferea uses the former to manage its feed list).

import amara
import sets

old_feedlist = '/home/uogbuji/.liferea/feedlist.opml'
new_feedlist = '/home/uogbuji/devel/uogbuji/webfeeds.xbel'

def get_feeds(feedlist):
    doc = amara.parse(feedlist)
    #try OPML first
    feeds = [ unicode(f) for f in doc.xml_xpath(u'//outline/@xmlUrl') ]
    if not feeds:
        #then try XBEL
        feeds = [ unicode(f) for f in doc.xml_xpath(u'//bookmark/@href') ]
    return feeds

old_feeds = sets.Set(get_feeds(old_feedlist))
new_feeds = sets.Set(get_feeds(new_feedlist))

added = new_feeds.difference(old_feeds)
for a in added: print a

I then send a subscription request for each new item as follows:

$ dbus-send   --dest=org.gnome.feed.Reader /org/gnome/feed/Reader \
  org.gnome.feed.Reader.Subscribe \
  "string:http://feeds.feedburner.com/DrMacrosXmlRants"

The first time I got an error "Failed to open connection to session message bus: Unable to determine the address of the message bus". I did an apropos dbus and found dbus-launch. I added the suggested stanza to my .bash_profile:

if test -z "$DBUS_SESSION_BUS_ADDRESS" ; then
    ## if not found, launch a new one
    eval ‘dbus-launch --sh-syntax --exit-with-session‘
    echo "D-BUS per-session daemon address is: $DBUS_SESSION_BUS_ADDRESS"
fi

After running dbus-launch the dbus-send worked and Liferea immediately popped up a properties dialog box for the added feed, and stuck it into the feeds tree at the point I happened to last be browsing in Liferea (not sure I like that choice of location). Simple drag&drop to put it where I want. Neat.

[Uche Ogbuji]

via Copia

Recipe for freezing 4Suite or Amara apps (cross-platform)

Updated based on user experience.

Recently a user mentioned having trouble freezing an Amara app. This question comes up every six months or so, it seems, so I decided to make sure I have a recipe for easy reference. I also wanted to make sure that successful freezing would not require any changes in 4Suite before the next release. I started with the most recent success report I could find, by Roman Yakovenko. Actually, his recipe ran perfectly well as is. All I'm doing here is expanding on it.

Recipe: freezing 4Suite or Amara apps

Grab cxFreeze. I used the 3.0.1 release, which I built from source on Fedora Core 4 Linux and Python 2.4.1). Updated: I've updated freezehack.py to work with cxFreeze 3.0.2, thanks to Luis Miguel Morillas.

Grab freezehack.py, which was originally put together by Roman. Add it to your PYTHONPATH.

Add import freezehack to your main Python module for the executable to be created. update actually, based on Mike Powers' experience you might have to add this import to every module that imports amara or Ft.

Freeze your program as usual. Run FreezePython.exe (or FreezePython on UNIX).

See the following sample session:

$ cat main.py
import freezehack
import amara
diggfeed = amara.parse("http://www.digg.com/rss/index.xml")
print diggfeed.rss.channel.item.title

$ FreezePython --install-dir dist --target-name testexe main.py
[SNIP]
Frozen binary dist/testexe created.

$ ./dist/testexe
Guess-the-Google - Simple but addictive game

In order to share the executable you have to copy the whole dist directory to the target machine, but that's all you should need to do. Python, 4Suite, Amara and any other such dependencies are bundled automatically.

Now back to the release.

[Uche Ogbuji]

via Copia

Merging Atom 1.0 feeds with Python

mergeatom.py

At the heart of Planet Atom is the mergeatom module. I've updated mergeatom a lot since I first released it. It's still a simple Python utility for merging multiple Atom 1.0 feeds into an aggregated feed. Some of the features:

  • Reads in a list of atom URLs, files or content strings to be merged into a given target document
  • Puts out a complete, merged Atom document (duplicates by atom:id are suppressed).
  • Collates the entries according to date, allowing you to limit the total. WARNING: Entries from the original Atom feed may be deleted according to ID duplicate removal or entry count limits.
  • Allows you to set the sort order of resulting entries
  • Uses atom:source elements, according to the spec, to retain key metadata from the originating feeds
  • Normalizes XML namespaces prefixes for output Atom elements (uses atom:*)
  • Allows you to limit contained entries to a date range
  • Handles base URIs fixup intelligently (Base URIs on feed elements are) migrated on to copied entries so that contained relative links remain correct

It requires atomixlib 0.3.0 or more recent, and Amara 1.1.6 or more recent

[Uche Ogbuji]

via Copia

Planet Atom

Planet Atom is now live.

Planet Atom focuses Atom streams from authors with an affinity for syndication and Atom-specific issues. This site was developed by Sylvain Hellegouarch, Uche Ogbuji, and John L. Clark. Please visit the Planet Atom development site if you are interested in the source code. The complete list of sources is maintained in XBEL format (with some experimental extensions); please contact one of the site developers if you want to suggest a modification to this list.

John, Sylvain and I have been working at this on and off for over a month now (we've all been swamped with other things—the actual development of the site was fairly straightforward). Planet Atom is built on an aggregation from Atom 1.0 feeds into one larger feed (with entries collated, trimmed etc.) It's built on 4Suite (for XSLT processing), CherryPy (for Web serving), Amara (for Atom feed slicing and dicing), atomixlib (for building the aggregate feed) and dateutil (for date wrangling), with Python and XML as the twin foundations, of course. Thanks to folks on the #atom and #swhack IRC channels for review and feedback.

[Uche Ogbuji]

via Copia

ElementTree in Python stdlib

The big news in Python/XML in December was the checking in of ElementTree into Python's standard library, which means the package will be part of Python 2.5. The python-dev summary has a good account of the move. I'm surprised it took so long. Even though I've been involved in at least one ElementTree versus 4Suite battle (I also wrote the earliest and possibly only comprehensive treatment of ElementTree), I've always appreciated that ElementTree's relative weightlessness makes it a decent option for Pythoneers who really don't care much about all the intricacies of XML and just have to reluctantly deal with a mess of angle brackets. That's the sort of library that ends up in the stdlib, and for good reason. And for sure, developers have needed a stdlib alternative to SAX and DOM for ages.

A very useful side effect of this move is a step towards merging PyXML into Python SVN, and eliminating the _xmlplus hack that was used to have PyXML installs shadow the stdlib originals. XML-SIG has argued about this for years, but there was never a stimulus for anyone to actually get something done about it. To quote from my article "Gems from the Mines: 2002 to 2003".

In early 2003] Martijn Faassen kicked off a long discussion on the future of PyXML by [complaining about the "_xmlplus hack" that PyXML uses to serve as a drop-in replacement for the Python built-in XML libraries. After he reiterated the complaint the discussion turned to a very serious one that underscored the odd in-between status of PyXML, and in what ways the PyXML package continued to be relevant as a stand-alone. Most of these issues are still not resolved, so this thread is an important airing of considerations that affect many Python/XML users to this day.

Congrats to Fredrik on this culmination of his hard work. And here's to the continued diversity of developer options in the very complex field of XML processing.

[Uche Ogbuji]

via Copia

Expat 2.0 (featured in 4Suite CVS)

Expat 2.0 has made it to the world, after a long incubation. Expat is, of course, the very popular XML parser in C originally developed by James Clark. The first I learned of this development was an announcement by Jeremy Kloth. His announcement also mentioned that current 4Suite CVS includes Expat 2.0, and that it's probably the first outside project to do so. In fact, the most recent 4Suite beta release included an Expat CVS snapshot that was for all practical purposes 2.0.

This is a very important milestone as it will allow Expat development to move on to more innovative pastures. And of course it adds the essential—support for AmigaOS (I'm going to get hate mail from my Amiga booster friends from college).

[Uche Ogbuji]

via Copia

Recipe: fast scan of an XML file for one field

If you have a huge XML file and you need to grab the first instance of a particular field in a fast and memory efficient manner, a simple one-liner with Amara's pushbind does the trick.

val = unicode(amara.pushbind("book.xml", "title").next())

Returns the text of the first title element in a book.xml (which could be Docbook or any other format with a title element), loading hardly any of the file into memory. It also doesn't parse the file beyond the target element. It would be a shade slower to get such an element at the end of a file. For example, the following line gets the title of a Docbook index.

val = unicode(amara.pushbind("book.xml", "index/title").next())

Even when finding an element near the end of a file it's very fast. All my use cases were pretty much instantaneous working with a 150MB file (I'm working on convincing the client to avoid such huge files).

If the target element is not found, you'll get a StopIteration exception.

[Uche Ogbuji]

via Copia