Using Amara's pushtree for heavyweight XML processing in GRDDL and SPARQL querying

I’ve been using Amara to address my high throughput needs for Extract Transform Load (ETL), querying, and processing of large amounts of RDF. In one particular part of the larger process, I needed to be able to stream very large XML documents in a particular dialect into RDF/XML. I sent an email to the akara google group describing the challenges and my thoughts behind wanting to use a streaming XML paradigm rather than XSLT.

I basically want to leverage Amara’s pushtree and its use of coroutines as a minimal-overhead pipeline for dispatching events triggered by elements in the source XML, where the source XML is a GRDDL source document and the pushtree coroutine is the transformation property. That task is still a work in progress, in the interest of expedience I went forward and used XSLT but need to try out some of what Uche suggested in the end.

The other part where I have made much more progress is in streaming results to SPARQL queries (against a SPARQL service) into a CSV file via command-line and with minimal overhead (also using Amara, pushtree, and coroutines). A recent set of changes to layercake-python modified the sparqler command-line to add an —endpoint option which takes a SPARQL service URL. Other changes were made to the remote SPARQL service store to support this.

Also added, was a new sparqlcsv script:

$ sparqlcsv --help
Usage: sparqlcsv [options] [SPARQLXMLFilePath]
Options:
 -h, --help            show this help message and exit
 -q QUOTECHAR, --quoteChar=QUOTECHAR
                       The quote character to use
 -c, --count           Just count the results, do not serialize to CSV
 -d DELIMITER, --delimiter=DELIMITER
                       The delimiter to use
 -o FILEPATH, --output=FILEPATH
                       The path where to write the resulting CSV file

This script takes a SPARQL XML file either from the file indicated as the first argument or from STDIN if none is specified and writes out a CSV file to STDOUT or to a file. The general architectural idea is to build a bash pipeline from the SPARQL service to a CSV file (and eventually into a relational database for more sophisticated analysis) or to STDOUT for subsequent processing along the pipeline.

So, now I can run a query against Virtuoso and stream the CSV results into a file (with minimal processing overhead):

$ sparqler --owl=..OWL file.. --ns=..prefix..=..URL.. \
           --endpoint=..SPARQL service URL.. \
"SELECT ... { ... }" | sparqlcsv | .. subsequent processong ..

Where the namespaces in the OWL/RDF file (provided by the —owl option) and those given explicitly via the —ns option are added as namespace prefix definitions at the top of the SPARQL query that is dispatched to the remote SPARQL service located via the URL provided to the —endpoint option. Alternatively, the -o option can be used to specify a filename where the CSV content is written to.

The sparqlcsv script uses a pushtree coroutine to stream XML content into a CSV file in this way:

def produce_csv(doc,csvWriter,justCount):
   cnt=Counter()
   @coroutine
   def receive_nodes(cnt):
       while True:
           node = yield
           if justCount:
               cnt.counter+=1
           else:
               rt=[]
               badChars = False
               for binding in node.binding:
                   try:
                       rt.append(U(binding).encode('ascii'))
                   except UnicodeEncodeError:
                       rt.append(U(binding).encode('ascii', 'ignore'))
                       badChars = True
                       print >> sys.stderr, "Skipping character", U(binding)
               if badChars:
                   cnt.skipCounter += 1
               csvWriter.writerow(rt)
       return
   target = receive_nodes(cnt)
   pushtree(doc, u'result', target.send, entity_factory=entity_base)
   target.close()
   return cnt

Where doc is an XML document (as a string), csvWriter is an instance of the Writer Object, and the last parameter indicates whether or not only the size of the solution sequence is returned rather than the resulting CSV.

Amara 1.2rc1

4Suite has been bumped to 1.0.2 with some important bug fixes. I also pushed Amara a step closer to 1.2 with a 1.2rc1 release. I'll make it 1.2 final some time this week, and then on to some pretty big architectural changes for 2.0. All test reports are welcome, especially from Web server users. Jeremy might have figured out a workaround fo the multiple-interpreter issue discussed in "multiple interpreters and extension modules". That should fix remaining known problems with mod_python.

[Uche Ogbuji]

via Copia

Amara 1.2 goes alpha, and other developments

First of all 4Suite went 1.0 rather quietly because the day job schedule has left room for very little besides quiet releases. It's probably just as well because by common standards 4Suite has been 1.0 grade for years. Under any less conservative version numbering scheme it would be 4Suite 3.0 by now.

I'm pushing Amara to 1.2 (a more typical progression of version numbers in that case) and after a developers-only alpha, we've released alpha 1.2a2 publicly, but quietly. As I've hinted before I have a lot of ideas for Amara post 1.2. The next major branch will be a full rewrite, probably to be released as Amara 2.0. Anyway, see the draft for the 1.2 full release announcement.

I also put together a quick start recipe for Amara on Ubuntu, and Luis Miguel Morillas has one for Windows users in Spanish. He says he'll be translating it to English soon, and when he does, I'm sure he'll link it from his "Amara Installers for Windows Users" page.

[Uche Ogbuji]

via Copia

Quick grab of XHTML metadata

I recently needed some code to quickly scrape the metadata from XHTML Web pages, so I kicked up the following code:

import amara

XHTML1_NS = u'http://www.w3.org/1999/xhtml'
PREFIXES = { u'xh': XHTML1_NS }

def get_xhtml_metadata(source):
    md = {}
    for node in amara.pushbind(source, u'/xh:html/xh:head/*', prefixes=PREFIXES):
        if node.localName == u'title':
            md[u'title'] = unicode(node)
        if node.localName == u'link':
            linkinfo = dict([ (attr.name, unicode(attr))
                              for attr in node.xml_xpath(u'@*') ])
            md.setdefault(u'links', []).append(linkinfo)
        elif node.xml_xpath(u'self::xh:meta[@name]'):
            md[node.name] = unicode(node.content)
    return md

if __name__ == "__main__":
    import sys, pprint
    source = sys.argv[1]
    pprint.pprint(get_xhtml_metadata(source))

So, for example, scraping planet XML:

$ python xhtml-metadata.py http://planet.xmlhack.com/
{u'links': [{u'href': u'planet.css',
             u'media': u'screen',
             u'rel': u'stylesheet',
             u'title': u'Default',
             u'type': u'text/css'},
            {u'href': u'/index.rdf',
             u'rel': u'alternate',
             u'title': u'RSS',
             u'type': u'application/rss+xml'}],
 u'title': u'Planet XMLhack: Aggregated weblogs from XML hackers and commentators'}

[Uche Ogbuji]

via Copia

Amara trimxml: an XML reporting tool

For the past few months in my day job (consulting for Sun Microsystems) I've been working on what you can call a really big (and hairy) enterprise mashup. I'm in charge of the kit that actually does the mashing-up. It's an XML pipeline that drives merging, processing and correction of data streams. There are a lot of very intricately intersecting business rules and without the ability to make very quick ad-hoc reports from arbitrary data streams, there is no way we could get it all sorted out given our aggressive deadlines.

This project benefits greatly from a side task I had sitting on my hard drive, and that I've since polished and worked into the Amara 1.1.9 release. It's a command-line tool called trimxml which is basically a reporting tool for XML. You just point it at some XML data source and give it an XSLT pattern for the bits of interest and optionally some XPath to tune the report and the display. It's designed to only read as much of the file as needed, which helps with performance. In the project I discussed above the XML files of interest range from 3-100MB.

Just to provide a taste using Ovidiu Predescu's old Docbook example, you could get the title as follows:

trimxml http://xslt-process.sourceforge.net/docbook-example.xml book/bookinfo/title

Since you know there's just one title you care about you can make sure trimxml stops looking after it finds it

trimxml -c 1 http://xslt-process.sourceforge.net/docbook-example.xml book/bookinfo/title

-c is a count of results and you can set it to other than 1, of course.

You can get all titles in the document, regardless of location:

trimxml http://xslt-process.sourceforge.net/docbook-example.xml title

Or just the titles that contain the string "DocBook":

trimxml http://xslt-process.sourceforge.net/docbook-example.xml title "contains(., 'DocBook')"

The second argument is an filtering XPath expression. Only nodes that satisfy that condition are reported.

By default each entire matching node is reported, so you get an output such as "". You can specify something different to display for each match using the -d flag. For example, to just print the first 10 characters of each title, and not the title tags themselves, use:

trimxml -d "substring(., 0, 10)" http://xslt-process.sourceforge.net/docbook-example.xml title

There are other options and features, and of course you can use the tool on local files as well as Web-based files.

In another useful development in the 4Suite/Amara world, we now have a Wiki.

With 4Suite, Amara, WSGI.xml, Bright Content and the day job I have no idea when I'll be able to get back to working on Akara, so I finally set up some Wikis for 4Suite.org. The main starting point is:

http://notes.4suite.org/

Some other useful starting points are

http://notes.4suite.org/AmaraXmlToolkit
http://notes.4suite.org/WsgiXml

As a bit of an extra anti-vandalism measure I have set the above 3 entry pages for editing only by 4Suite developers. [...] Of course you can edit and add other pages in usual Wiki fashion. You might want to start with http://notes.4suite.org/4SuiteFaq which is a collaborative addendum to the official FAQ.

[Uche Ogbuji]

via Copia

Amara en Español

My focus during open-source development availability has been on pushing 4Suite XML to 1.0 (and we're on the very final leg of that journey). I'm still putting a bit of time into Amara, but I should have even more time for it soon, and I have many ideas for what to do with that time.

Others have been up to fun stuff with Amara as well, and no more so, it seems, than Spanish speakers. Luis Miguel Morillas has been putting Amara through its paces in his LivingPyXML project. César Cárdenas Desales has contributed a nice intro "Procesamiento fácil de XML con Python y Amara"

A pesar de que la libreria estándar Python cuenta con herramientas y modulos para el procesamiento de XML con SAX y DOM, muchos programadores han pensado que podrían existir formas más simples de trabajar con XML. Amara es un conjunto de herramientas que sirven para facilitar el procesamiento de XML usando Python. En este manual se da una breve introducción al uso de Amara para dichas tareas.

Yep. That was pretty much the entire idea.

Original link (not as up-to-date): "Procesamiento fácil de XML con Python y Amara"

[Uche Ogbuji]

via Copia

Schematron creeping on the come-up (again)

Schematron is the boss XML schema language your boss has never heard of. Unfortunately it's had some slow times of recent, but it's surged back with a vengeance thanks to honcho Rick Jelliffe with logistical support from Betty Harvey. There's now a working mailing list and a Wiki. Rick says that Schematron is slated to become an ISO Standard next month.

The text for the Final Draft Internation Standard for Schematron has now been approved by multi-national voting. It is copyright ISO, but it is basically identical to the draft at www.schematron.com

The standard is 30 pages. 21 are normative, including schema listings and a characterization of Schematron semantics in predicate logic. Appendixes show how to use other query language bindings (than XSLT1), how to use Schematron as a vocabulary, how to express multi-lingual dignostics, and a simple description of the design requirements for ISO Schematron.

Congrats to Rick. Here's to the most important schema language of them all (yes, I do mean that). I guess I'll have to check Scimitar, Amara's fast Schematron processor for compliance to the updated draft standard.

[Uche Ogbuji]

via Copia

Adding feeds to Liferea on the command line

Despite the kind help of the Rojo people I still can't get the service to import my updated feed lists ('An error has occurred...Failed to import: null...We apologize for the inconvenience.'), so I'm still reading my Web feeds on Liferea for now. One nice bonus with Liferea is the ability to add feeds from the command line (or really, any program) courtesy GNOME's DBUS. Thanks to Aristotle for the tip, pointing me to 'a key message on liferea-help'. I've never used DBUS before, so I may be sketchy on some details, but I got it to work for me pretty easily.

I start with a simple script to report on added feed entries. It automatically handles feed lists in OPML or XBEL (I use the latter for managing my feed lists, and Liferea uses the former to manage its feed list).

import amara
import sets

old_feedlist = '/home/uogbuji/.liferea/feedlist.opml'
new_feedlist = '/home/uogbuji/devel/uogbuji/webfeeds.xbel'

def get_feeds(feedlist):
    doc = amara.parse(feedlist)
    #try OPML first
    feeds = [ unicode(f) for f in doc.xml_xpath(u'//outline/@xmlUrl') ]
    if not feeds:
        #then try XBEL
        feeds = [ unicode(f) for f in doc.xml_xpath(u'//bookmark/@href') ]
    return feeds

old_feeds = sets.Set(get_feeds(old_feedlist))
new_feeds = sets.Set(get_feeds(new_feedlist))

added = new_feeds.difference(old_feeds)
for a in added: print a

I then send a subscription request for each new item as follows:

$ dbus-send   --dest=org.gnome.feed.Reader /org/gnome/feed/Reader \
  org.gnome.feed.Reader.Subscribe \
  "string:http://feeds.feedburner.com/DrMacrosXmlRants"

The first time I got an error "Failed to open connection to session message bus: Unable to determine the address of the message bus". I did an apropos dbus and found dbus-launch. I added the suggested stanza to my .bash_profile:

if test -z "$DBUS_SESSION_BUS_ADDRESS" ; then
    ## if not found, launch a new one
    eval ‘dbus-launch --sh-syntax --exit-with-session‘
    echo "D-BUS per-session daemon address is: $DBUS_SESSION_BUS_ADDRESS"
fi

After running dbus-launch the dbus-send worked and Liferea immediately popped up a properties dialog box for the added feed, and stuck it into the feeds tree at the point I happened to last be browsing in Liferea (not sure I like that choice of location). Simple drag&drop to put it where I want. Neat.

[Uche Ogbuji]

via Copia

Recipe for freezing 4Suite or Amara apps (cross-platform)

Updated based on user experience.

Recently a user mentioned having trouble freezing an Amara app. This question comes up every six months or so, it seems, so I decided to make sure I have a recipe for easy reference. I also wanted to make sure that successful freezing would not require any changes in 4Suite before the next release. I started with the most recent success report I could find, by Roman Yakovenko. Actually, his recipe ran perfectly well as is. All I'm doing here is expanding on it.

Recipe: freezing 4Suite or Amara apps

Grab cxFreeze. I used the 3.0.1 release, which I built from source on Fedora Core 4 Linux and Python 2.4.1). Updated: I've updated freezehack.py to work with cxFreeze 3.0.2, thanks to Luis Miguel Morillas.

Grab freezehack.py, which was originally put together by Roman. Add it to your PYTHONPATH.

Add import freezehack to your main Python module for the executable to be created. update actually, based on Mike Powers' experience you might have to add this import to every module that imports amara or Ft.

Freeze your program as usual. Run FreezePython.exe (or FreezePython on UNIX).

See the following sample session:

$ cat main.py
import freezehack
import amara
diggfeed = amara.parse("http://www.digg.com/rss/index.xml")
print diggfeed.rss.channel.item.title

$ FreezePython --install-dir dist --target-name testexe main.py
[SNIP]
Frozen binary dist/testexe created.

$ ./dist/testexe
Guess-the-Google - Simple but addictive game

In order to share the executable you have to copy the whole dist directory to the target machine, but that's all you should need to do. Python, 4Suite, Amara and any other such dependencies are bundled automatically.

Now back to the release.

[Uche Ogbuji]

via Copia

Merging Atom 1.0 feeds with Python

mergeatom.py

At the heart of Planet Atom is the mergeatom module. I've updated mergeatom a lot since I first released it. It's still a simple Python utility for merging multiple Atom 1.0 feeds into an aggregated feed. Some of the features:

  • Reads in a list of atom URLs, files or content strings to be merged into a given target document
  • Puts out a complete, merged Atom document (duplicates by atom:id are suppressed).
  • Collates the entries according to date, allowing you to limit the total. WARNING: Entries from the original Atom feed may be deleted according to ID duplicate removal or entry count limits.
  • Allows you to set the sort order of resulting entries
  • Uses atom:source elements, according to the spec, to retain key metadata from the originating feeds
  • Normalizes XML namespaces prefixes for output Atom elements (uses atom:*)
  • Allows you to limit contained entries to a date range
  • Handles base URIs fixup intelligently (Base URIs on feed elements are) migrated on to copied entries so that contained relative links remain correct

It requires atomixlib 0.3.0 or more recent, and Amara 1.1.6 or more recent

[Uche Ogbuji]

via Copia