For the past few months in my day job (consulting for Sun Microsystems)
I've been working on what you can call a really big (and hairy)
enterprise mashup. I'm in charge of the kit that actually does the
mashing-up. It's an XML pipeline that drives merging, processing and
correction of data streams. There are a lot of very intricately
intersecting business rules and without the ability to make very quick
ad-hoc reports from arbitrary data streams, there is no way we could get
it all sorted out given our aggressive deadlines.
This project benefits greatly from a side task I had sitting on my hard
drive, and that I've since polished and worked into the Amara 1.1.9
release.
It's a command-line tool called trimxml which is basically a reporting
tool for XML. You just point it at some XML data source and give it an
XSLT pattern for the bits of interest and optionally some XPath to tune
the report and the display. It's designed to only read as much of the
file as needed, which helps with performance. In the project I
discussed above the XML files of interest range from 3-100MB.
Just to provide a taste using Ovidiu Predescu's old Docbook
example, you
could get the title as follows:
trimxml http://xslt-process.sourceforge.net/docbook-example.xml book/bookinfo/title
Since you know there's just one title you care about you can make sure
trimxml stops looking after it finds it
trimxml -c 1 http://xslt-process.sourceforge.net/docbook-example.xml book/bookinfo/title
-c
is a count of results and you can set it to other than 1, of course.
You can get all titles in the document, regardless of location:
trimxml http://xslt-process.sourceforge.net/docbook-example.xml title
Or just the titles that contain the string "DocBook":
trimxml http://xslt-process.sourceforge.net/docbook-example.xml title "contains(., 'DocBook')"
The second argument is an filtering XPath expression. Only nodes that
satisfy that condition are reported.
By default each entire matching node is reported, so you get an output
such as "". You can specify
something different to display for each match using the -d
flag. For
example, to just print the first 10 characters of each title, and not
the title
tags themselves, use:
trimxml -d "substring(., 0, 10)" http://xslt-process.sourceforge.net/docbook-example.xml title
There are other options and features, and of course you can use the tool
on local files as well as Web-based files.
In another useful development in the 4Suite/Amara world, we now have a
Wiki.
With 4Suite, Amara, WSGI.xml, Bright Content and the day job I have no
idea when I'll be able to get back to working on Akara, so I finally set
up some Wikis for 4Suite.org. The main starting point is:
http://notes.4suite.org/
Some other useful starting points are
http://notes.4suite.org/AmaraXmlToolkit
http://notes.4suite.org/WsgiXml
As a bit of an extra anti-vandalism measure I have set the above 3
entry pages for editing only by 4Suite developers. [...] Of course you
can edit and add other pages in usual Wiki fashion. You might want to
start with http://notes.4suite.org/4SuiteFaq which is a collaborative
addendum to the official FAQ.
[Uche Ogbuji]