Amara trimxml: an XML reporting tool

For the past few months in my day job (consulting for Sun Microsystems) I've been working on what you can call a really big (and hairy) enterprise mashup. I'm in charge of the kit that actually does the mashing-up. It's an XML pipeline that drives merging, processing and correction of data streams. There are a lot of very intricately intersecting business rules and without the ability to make very quick ad-hoc reports from arbitrary data streams, there is no way we could get it all sorted out given our aggressive deadlines.

This project benefits greatly from a side task I had sitting on my hard drive, and that I've since polished and worked into the Amara 1.1.9 release. It's a command-line tool called trimxml which is basically a reporting tool for XML. You just point it at some XML data source and give it an XSLT pattern for the bits of interest and optionally some XPath to tune the report and the display. It's designed to only read as much of the file as needed, which helps with performance. In the project I discussed above the XML files of interest range from 3-100MB.

Just to provide a taste using Ovidiu Predescu's old Docbook example, you could get the title as follows:

trimxml http://xslt-process.sourceforge.net/docbook-example.xml book/bookinfo/title

Since you know there's just one title you care about you can make sure trimxml stops looking after it finds it

trimxml -c 1 http://xslt-process.sourceforge.net/docbook-example.xml book/bookinfo/title

-c is a count of results and you can set it to other than 1, of course.

You can get all titles in the document, regardless of location:

trimxml http://xslt-process.sourceforge.net/docbook-example.xml title

Or just the titles that contain the string "DocBook":

trimxml http://xslt-process.sourceforge.net/docbook-example.xml title "contains(., 'DocBook')"

The second argument is an filtering XPath expression. Only nodes that satisfy that condition are reported.

By default each entire matching node is reported, so you get an output such as "". You can specify something different to display for each match using the -d flag. For example, to just print the first 10 characters of each title, and not the title tags themselves, use:

trimxml -d "substring(., 0, 10)" http://xslt-process.sourceforge.net/docbook-example.xml title

There are other options and features, and of course you can use the tool on local files as well as Web-based files.

In another useful development in the 4Suite/Amara world, we now have a Wiki.

With 4Suite, Amara, WSGI.xml, Bright Content and the day job I have no idea when I'll be able to get back to working on Akara, so I finally set up some Wikis for 4Suite.org. The main starting point is:

http://notes.4suite.org/

Some other useful starting points are

http://notes.4suite.org/AmaraXmlToolkit
http://notes.4suite.org/WsgiXml

As a bit of an extra anti-vandalism measure I have set the above 3 entry pages for editing only by 4Suite developers. [...] Of course you can edit and add other pages in usual Wiki fashion. You might want to start with http://notes.4suite.org/4SuiteFaq which is a collaborative addendum to the official FAQ.

[Uche Ogbuji]

via Copia