Another on Amara

"XML Parsing with Python"--Derek Willis

Let’s face it, relational database types don’t like XML files. They’re structured, sure, but not in quite the way we’re used to. So pulling them apart is a chore for which there are many tools but few that seem to fit easily into the CAR Computer-Assisted Reporting] mindset. Enter [Python and the Amara toolkit. Amara builds on 4Suite, which processes XML and RDF, and it works in a very Pythonic way by essentially turning XML data into Python objects. If I have to parse XML into a relational database, Amara is my tool of choice.

One thing that I've especially appreciated about feedback on Amara is the way users cite it as an example of the essential power of Python, and why it is a draw from even outside of Python. This has always been my aim, more conventionally with 4Suite, and more subversively with Amara. When you compare the weary nature of, say Java XML data bindings, Amara is a nice advertisement for Python's dynamicism.

Later on Willis concludes:

CAR folks can think of [XML as processed through Amara] as calling field names, and instead of printing out elements you can insert them into a database. Nice and easy - the way everybody says XML should be.

And just the way I intended. Nice.

[Uche Ogbuji]

via Copia

Amara Appreciations

Last week I mentioned a kind message Sanjay Velamparambil sent to the 4Suite list. As I said, "it's always pleasing to me as a developer to hear a voice ring out from the noise clearly appreciating the value of the work we've done." This week, Amara gets some love.

In a message to the 4Suite list Wednesday Tom Lazar said:

i just wanted to chime in that just yesterday I had an urgent, real- world problem in where I needed to manipulate an XML Document
programmatically - grep/sed/awk on the textfile would have been too
difficult ("now you've got two problems"[tm]) and an XSLT alone
wouldn't have done it either.

using Amara i hacked a python script that did the whole job in
(literally) ten minutes. as I started the script I was a bit
apprehensive: afterall our script would pick out certain nodes and
assign new values to them (or delete them, depending) and then write
it back to the file - but it all worked without a hitch.

and looking at the script, you'd never think it was handling XML at
all ;-)

so thanks for making Amara and keep up the good work!

You're very welcome, Tom, and thanks for all the help with improving the bindery mutation API.

As if that wasn't enough, the same day I got a private message from another user. I haven't asked his permission so I won't identify him at the moment, but he actually put together a video clip of himself demonstrating Amara. In the clip he shows the eight or so custom Python modules for XML processing that were replaced with a one-liner using Amara. My only regret is that he and his team had to write all that other code in the first place, before finding Amara, but at least they don't have to maintain it any more.

Feedback like that makes all the long hours worthwhile.

[Uche Ogbuji]

via Copia

A 4Suite Appreciation

Sanjay Velamparambil's message

The best part of building 4Suite has always been the community. That goes way back to when Mike Olson and I were chuffed to hear from folks who'd stumbled across our inchoate DOM implementation (about 6 generations of the code ago), or our initial stabs at XPath and XSLT (about 4 generations in that case). Now the 4Suite community is a loud, thriving bazaar (100 messages to the user's list in a slow month), with all timbres of voices and all sorts of agendas. it's always pleasing to me as a developer to hear a voice ring out from the noise clearly appreciating the value of the work we've done. Thanks, Sanjay, for a very nice note.

[Uche Ogbuji]

via Copia

Deletion added to friendlier Amara mutation

As I've mentioned I added friendlier mutation API to Amara. Deletion didn't come up in the original discussion, but I just got around to addressing that as well. Now checked in are enhancements that support the following use cases:

Use case 10:

Source doc: spameggs
Code: del doc.a.b
Result: doc mutated to eggs

Use case 11:

Source doc: spameggs
Code: del doc.a.b[0]
Result: doc mutated to eggs

Use case 12:

Source doc: spameggs
Code: del doc.a.b[1]
Result: doc mutated to spam

Use case 13:

Source doc: spameggs
Code: del doc.a.b[2]
Result: IndexError

Use case 14:

Source doc: spam
Code: del doc.a.b
Result: doc mutated to spam

Of course there are oddities to go with the new convenience. Check out the following:

>>> from amara import binderytools
>>> doc = binderytools.bind_string("spamspam")
>>> unicode(doc.a.b)
u'spam'
>>> doc.a.b
<amara.bindery.b object at 0x685b2c>
>>> del doc.a.b
>>> unicode(doc.a.b)
u'spam'
>>> #Eh?  Still there, are ye?
...
>>> doc.a.b
<amara.bindery.b object at 0x685b8c>

Perfectly consistent with what the users seem to be saying, I think, but I'll be amazed if this doesn't trip up the odd fellow.

[Uche Ogbuji]

via Copia

Elements versus attributes in Amara

In the previous entry I discussed changes to Amara's mutation API. In the original discussion one of the things that came up was the old element/attribute conundrum. Take the following document:

Users like to be able to access both elements and attributes using friendly Python idiom, but here we have a name clash on the resulting a object.Right now Amara exposes the attribute as a.b and the element as a.b_, using name mangling to disambiguate.

The important thing to remember, however, is that such clashes are quite rare in practice, even when you throw in namespaces, so such mangling is rarely necessary, and I personally think Amara's current behavior makes sense. But I may just have a blind spot, so I've been paying attention to suggestions from others.

Jeremy Kloth suggested just always using different idioms. a.[u"b"] for the attribute and a.b for the element. This is not a bad idea, but I feel that given that clashes are rare, that it complicates the common case just to aid the rare case.

Luis Miguel Morillas had an idea I consider almost the opposite. Rather than completely separate element/attribute idioms, Luis suggests embracing how Amara has unified them. Right now Amara rolls up multiple elements of the same name in a convenient way:

Works such that a.b or a.b[0] yields the element with y and a.b[1] yields the element with z. Luis thinks that the following case should just be an extension of this:

And then a.b or a.b[0] would yields the attribute value (u"x"), a.b[1] would yield the element with y, and a.b[2] would yield the element with z. I kinda think of this idea as "so crazy it almost makes perfect sense", but it's way too big a change to introduce before Amara 1.0. I'd be curious to hear what others think of it. Luis actually brings it up in the context of mutation--see his original post (scroll to the bottom)--but I figure that the mutation API will follow naturally from the access API, so I'm focusing my thoughts a bit.

[Uche Ogbuji]

via Copia

Amara gets friendlier mutation

Tom Lazar asked for a friendlier idiom for mutating elements in Amara. I was reluctant at first because the simpler-on-the-surface idioms he wanted would require rather untidy idioms in the code. I relented to the argument that user convenience comes even before clean code. I finally got around to making and committing the changes today. I'd planned to release Amara 1.0b2 as soon as I'd made these changes, and the timing seems perfect since we've just released 4Suite 1.0b1, but the changes are intrusive enough that I think I'll give folks a chance to try things out from CVS and first see whether it craters for anyone. Please give it a go and give me feedback here or on the mailing list. Thanks.

Here are use cases illustrating the new idioms for Amara. I have added them to the test file mutation.py:

Use case 1:

Source doc: spam
Code: doc.a.b = u"eggs"
Result: doc mutated to eggs

Use case 2:

Source doc: spam
Code: doc.a.b[0] = u"eggs"
Result: doc mutated to eggs

Use case 3:

Source doc:
Code: doc.a.b = u"eggs"
Result: doc mutated to

Use case 4:

Source doc: spamspam
Code: doc.a.b = u"eggs"
Result: doc mutated to eggsspam

Use case 5:

Source doc: spamspam
Code: doc.a.b[0] = u"eggs"
Result: doc mutated to eggsspam

Use case 5:

Source doc: spamspam
Code: doc.a.b[1] = u"eggs"
Result: doc mutated to spameggs

Use case 6:

Source doc: spamspam
Code: doc.a.b[2] = u"eggs"
Result: IndexError

Use case 7:

Source doc: spam
Code: doc.a.b = u"eggs"
Result: doc mutated to spam

Note: attributes take precedence over same name elements in binding. See next use case.

Use case 8:

Source doc: spam
Code: doc.a.b_ = u"eggs"
Result: doc mutated to eggs

In a follow-up entry I'll talk about some other suggestions I've received on this matter.

[Uche Ogbuji]

via Copia

Installing 4Suite 1.0b1 as non-root

Update: How could I have forgotten --enable-unicode=ucs4 in the Python build instructions?

Just gathering up some details on how to install 4Suite as non-root (i.e. in a user's home directory). This is based on experience installing on Red Hat and Fedora Core, but should work for most POSIX environments.

If you don't have Python installed (or want your own copy):

Grab Python-2.3.x.tgz or Python-2.4.x.tgz and unpack:

tar zxvf ~/dl/Python-2.3.5.tgz
cd Python-2.3.5/
./configure --prefix=$HOME/lib --enable-unicode=ucs4

Pick whatever prefix works for you. --enable-unicode=ucs4 is essential IMO if you're doing XML processing.

make && make install
ln -s $HOME/lib/bin/python $HOME/bin

The last step is to put the Python exe you just built into your $PATH, presumably before any other Python exe in the system.

Now for 4Suite

Grab 4Suite 1.0b1

cd $DOWNLOADS
tar zxvf 4Suite-1.0b1.tar.gz
cd 4Suite-1.0b1
python setup.py config --prefix=$HOME/lib
python setup.py install

Notice the extra "setup.py config" step. This is the key to the whole thing. The "setup.py config" sets the location for all the files installed by 4Suite except for the Python library files, which are installed to the location determined by the Python executable used to invoke the setup script. For more on where 4Suite puts things, see Mike Brown's excellent document "4Suite Installation Locations".

There is also a --home option to setup.py config, but do not use this unless you really know what you're doing. Stick to --prefix.

Finally you may want to make a link for all the 4suite commands to your home's bin directory

ln -s $HOME/lib/bin/4* $HOME/bin

Now you can run the tests.

cd $HOME/lib/lib/4Suite

Remember that this is beta software, and some test failures are to be expected (heck, I'd be amazed if there weren't some test failures with the full 1.0 release).

[Uche Ogbuji]

via Copia

4Suite 1.0b1

The announcement

Yaaaaay! This is nominally the feature freeze for the way, way overdue 4Suite 1.0. Let's hope this freeze accelerates progress towards full 1.0. Thanks to all our patient users. The focus of this release is probably performance. Jeremy Kloth, one of the best programming minds I've encountered, threw himself into the challenge of squeezing waste out of Domlette, without losing its great functional benefits. Some of the resulting gains are amazing. There are a lot of other fixes and enhancements, and I think it's a very solid release.

My next step is to release Amara 1.0b2 and kick off a branch to take better use of some of Jeremy's enhancements, including a super-efficient mini-SAX for Domlette.

4Suite home page

[Uche Ogbuji]

via Copia

4Suite for RDF

RDF hacking for fun and profit -- Bill de hÓra

"I find 4Suite to be stable software (tho' I'm not sure the RDF stuff is active anymore"

The main limitation with RDF in 4Suite is that it has not not been tracking the latest specs. This sucks, but it reflects the reality of "it works, and grand updates don't scratch anyone's itch". 4Suite's RDF library is actually very stable, and has been accumulating bug fixes, performance fixes and new drivers.

I'd say that 4RDF is fine if you don't need all the nuances of the new specs (which are modest enough). It is heavily used, which is one nice test of its suitability. We do have grand post-1.0 plans, but they are not yet set in stone. My guess is the following:

  • We'll abandon our own parser for rdflib. That parser is SAX-based, has been tracking the latest specs, and is very well tested. This is actually something we and the rdflib folks have been discussing near forever. We just haven't got around to the actual work (itch scratching need and all that).
  • We'll make the low-level API more Pythonesque. Developments such as iterators and generators have come since the original 4RDF effort, and we want to put them to good use.
  • We'll work in a Versa 2.0 (RDF query language). SPARQL is not doing it for me, and for a lot of my colleagues and corresponents. OK. I'll be blunt. I think SPARQL sucks, and I'm likely to support W3C XML Schema before I support it (hint: earthworms will fly of their own locomotion before either event).

"Uche et al have been working on anobind most recently)..."

Well, that's just me, no et alii so far. And Anobind is no more. It has been absorbed into Amara XML Toolkit. I'm developing Amara in order to complement 4Suite, not to supplant it in any way. It's an add-on to 4Suite that gives Pythoneers the super-friendly idioms they like. I still put into 4Suite about as much effort as I do Amara.

One shouldn't make any assumptions on 4Suite development based on Amara.

[Uche Ogbuji]

via Copia