Use Amara to parse/process (almost) any HTML

It's always nice when a client obligation indirectly feeds a FOSS project. I wanted to do some Web scraping to do recently while doing the day job thingie. As with most people who do what I do these days it's a common task, and I'd already done some exploring of the Python tool base for this in "Wrestling HTML". In that article I touched on tidy and its best known Python wrapper uTidyLib. One can use these to turn zany HTML into fairly clean XHTML. In the most recent task I, however, had a lot of complex processing to do with the resulting pages, and I really wanted the flexibility of Amara Bindery, so I cooked up some code (much simpler than I'd expected) to use the command-line tidy program to turn arbitrary Web pages into XHTML in the form of Amara bindery objects.

I just checked this code in as an Amara demo, tidy.py. As an example of its usage, here is some Python script I wrote to list all the mp3s links from a given Web page, (for easy download with wget):

from tidy import tidy_bind_url #needs tidy.py Amara demo
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
doc = tidy_bind_url(url)
#Display all links to mp3s (by file extension check)
for link in doc.xml_xpath(u'//html:a[@href]'):
    if link.href.endswith(u'.mp3'):
        print link.href

The handy thing about Amara even in this simple example is how I was am to take advantage of the full power of XPath for the basic query, and then shunt in Python where XPath falls short (there's a starts-with function in XPath 1.0 but for some reason no ends-with). See tidy.py for more sample code.

Tidy does choke on some abjectly broken HTML pages, but it has done the trick for me 90% of the time.

Meanwhile, I've been meaning to release Amara 1.0. I haven't needed to make many changes since the most recent beta, and it's pretty much ready (and I need to get on to some cool new stuff in a 2.0 branch). A heavy workload has held me up, but perhaps this weekend.

[Uche Ogbuji]

via Copia

Thinking XML #32: Schema annotation for bottom-up semantic transparency:

"Thinking XML: Schema annotation for bottom-up semantic transparency"

Subtitle: Pushing schemata beyond syntax into semantics
Synopsis: Learn more about the different approaches to semantic transparency as Uche Ogbuji discusses what they mean to developers using XML. Whether or not you reuse schemata, you might find it valuable to use formal annotations (as opposed to the informal annotations covered earlier). You gain benefits on several levels by doing so. On the most immediately practical level, you can generate better documentation. A more far-sighted benefit is that it gives you an important measure of semantic transparency. This installment discusses semantic anchors, and gives examples. The author also takes a moment to discuss The XTech Conference 2005.

This is the third part of a mini-series within the column. Previous articles are "State of the art in XML modeling" and "Schema standardization for top-down semantic transparency". In this article I discuss formal schema annotations, the most important tool available for semantic transparency. I started off my exploration of the technique in "Use data dictionary links for XML and Web services schemata". I mentioned why I think schema annotations are so important even in rough and ready use of XML in my discussion of XOXO.

See other articles in the column. Comments here on Copia or on the column's official discussion forum. Next up in Thinking XML, back to Python + WordNet.

[Uche Ogbuji]

via Copia

Python community: Transolution, py lib, encutils, pyxsldoc, PDIS and Picket

In a comment on "We need more solid guidelines for i18n in OSS projects", Fredrik Corneliusson mentioned Transolution, "[a translation] suite project I'm working on [with] a XLIFF editor and a Translation Memory. It's written in Python and we need all the help and testers we can get." I browsed the project site, and it seems to me quite comprehensive and well thought-out. It's heavy on XLIFF, which is pretty heavy stuff in itself, but it does links to projects that allow exchange between .po and XLIFF files. It's certainly great to see Python at the vanguard of XML-based i18n.

I found a couple of new tools from the py lib via Grig Gheorghiu's Weblog entry "py lib gems: greenlets and py.xml", which is good reading for those interested in Python and XML. The py lib is just a bundle of useful Python library add-ons. Greg mentioned sample for one of the modules, Armin Rigo's greenlets. Greenlets are a "spin-off" from Stackless Python, and thus provide some very interesting code that manipulates Python flow of control in order to support lightweight concurrency (microthreads), and also proper coroutines. I've already been pecking about the edges of what's possible with semi-coroutines, and it has always been clear to me that what Python needs in order to really bring streaming XML processing to life is full coroutine support (which seems to be on the way for Python 2.5. While we wait for full coroutines in Python, Armin gets us started with a greenlets demo that turns PyExpat's callback model into a generator that returns elements as they're parsed. Grig posts this snippet as "iterxml.py" in his entry.

Grig also touches on Holger Krekel's py.xml, a module for generating XML (and HTML). py.xml is not unlike JAXML, which I covered in "Three More For XML Output". These all seem to project back to Greg Stein's early proposal for an XML generation tool that certainly worth a look for others.is as ingrown as possible into Python's syntax.

Sylvain Hellegouarch updated Picket, a simple CherryPy filter for processing XSLT as a template language. It uses 4Suite to do the job. This update accommodates changes in what has recently been announced as CherryPy 2.1 beta. A CherryPy "filter is an object that has a chance to work on a request as it goes through the usual CherryPy processing chain."

Christof Hoeke has been busy lately. He has developed encutils for Python 0.2, which is a library for dealing with the encodings of files obtained over HTTP, including XML files. He does not yet implement an algorithm for sniffing an XML encoding from its declaration, but I expect he should be able to add this easily enough using the well-known algorithms for this task (notably the one described by John Cowan), which are the basis for this older Python cookbook recipe by Paul Prescod and this newer recipe by Lars Tiede. Christof also released pyxsldoc 0.69, "an application to produce documentation for XSLT files in XHTML format, similar to what javadoc does for Java files." See the announcements for encutils and pyxsldoc.

I recently discovered Ken Rimey's Personal Distributed Information Store (PDIS), which includes some XML tools for Nokia's Series 60 phones, which offer python support. This includes an XML parser based on PyExpat and an XPath implementation based on elementtree.

[Uche Ogbuji]

via Copia

Quotīdiē

Old pirates, yes, they rob I,
Sold I to the merchant ships,
Minutes after they took I
From the bottomless pit.
But my hand was made strong
By the hand of the almighty.
We forward in this generation
Triumphantly.
Won't you help to sing
These songs of freedom?
'Cause all I ever have—
Redemption songs.

Ms. Dynamite—from her Live8 cover of Bob Marley's "Redemption Song"

I've been hearing a lot about Ms. Dynamite's performance at Live8. Numerous attendees have rhapsodized over the power of her "Redemption Song" cover. Even commentators who had already blasted her for being the token black Live8 performer seemed to soften their tone when talking about her actual contribution. And BTW, yes, although I have plenty of beef with Live8, as I had to express to a friend lately, that does not mean that I've ever felt it necessary as a result to denigrate everyone who supported Live8. I'll leave the indiscriminate spray of spleen to others. Anyway did people really expect anything pedestrian from the wicked brilliant Ms. Dynamite? From the woman who can chat in rapid syncopated fire like a semi-automatic gun, and then sing as engagingly as a Savannah weaver bird? Once I heard that she covered Bob Marley's wonderful song, I knew I had to hear for myself.

The first time I heard Ms. Dynamite was when she set fire 'pon Sticky's UK Garage club anthem, "Booo!", which soon became an Ogbuji household anthem. Next I heard her shred the So Solid track "Envy (They don't know)" (which couldn't become an anthem at our house because Lori unfortunately hates The So Solid Crew). So we were mad ready when she dropped A Little Deeper ("It takes more" and "Dynamite" from the single had already taken their turn as household anthems). But never mind my family's endorsement, let's hear from Ali G:

Next up is MC Dynamite, who is me favorite Garage MC with his or her track called "Dynamite". That is a wicked name for the track and me swear this track is just like Dynamite, because it's going to explode like a massive bit of dynamite. And like this kind of record, dynamite can make a lot of mess and proper mash things up, just like Dynamite can. Oh yeah, this track can also blow up like dynamite. Sure this track ain't red, and don't come in boxes with the name "dynamite" on them, but this tune is also on fire, just like Dynamite, innit? This is also a banging tune, and dynamite goes "bang" when it come out of the box, doesn't it?

OK. Enough with the Sacha Cohen. I hunted down the Live8 performance, first finding an AOL/Netscape widget site that offered Live8 videos but refused to work with Firefox. I did eventually find a collection of Live8 mp3s, including this "Redemption Song" Live8 clip. I also got the concert version "Dy-na-mi-tee", another favorite, a really sweet old-school romp (old school beat, old school sentiment, etc.) through her airy brand of nostalgia. I must say it sounded a bit muddled and rushed at Live8, which I can understand from what I heard of the logistical difficulties of cramming so many acts together in such an unforgiving schedule. She did add bongos to the background, which I think is a nice touch. Sounds as if it would have made a nice studio remix, but she's on to her next project, I understand. Hells yeah. I'm all about a new Ms. Dynamite album (can't find any solid links yet, just the rumors of a new album).

One note of interest, some cat I don't think I've heard before performed a rap at the end of "Redemption Song". The lyrics are fairly insightful, with just a couple of WTF bits.

What's going on, nothing's changed, we're still exploiting the poor
Slavery never ends, yo it just changed wars
AIDS and free trade decimating the young
Famine everywhere but why never a shortage of guns?
Conflict, duel all over the globe instigated by our leaders
War in the Motherland but no African arms dealers
The West robbed the third world of every single cent
Now there's Third World debt. How does that make sense?

The last two lines do smack it all home, on the real, although I think we need to get past all that. Africans will get theirs back from the West, over time. Demographic power and all that. The more immediate concern is Africa's independent economic development.

I do still say: Live8 in London, eh? No Roots Manuva, eh? No Ty? No Klashnekoff? No Est'elle? No Blak Twang? Heck, not even Dizzee Rascal? Somebody didn't do their Supreme Mathematics, son.

But at least they got some Dynamite, and we got a reminder that Bob Marley's song is a superlative testament to the emotive and universal power of music.

And hey. Yay! I scrounged out a few minutes for a Quotīdiē. Chicken noodle soup for the overworked soul.

[Uche Ogbuji]

via Copia

RDF IRC Agent - Emeka

I've recently been working on an IRC bot (based on Sean Palmer's phenny) called Emeka which is meant as a tool for demonstrating Versa and other related RDF facilities. Currently, it supports two functions:

  • .query <abritrary URI> " .. Versa Query .. "
  • .query-deep <arbitrary URI> steps " .. Versa Query .. "

The first, causes Emeka to parse the RDF document (associated with the given URI) as RDF/XML and then as N3 (if the first attempt fails). He then evaluates the submitted Versa Query against the Graph and responds with a result. The second function does the same with the exception that it recursively loads RDF graphs (following rdfs:seeAlso statements) N times, where N is the second argument. This is useful for extracting FOAF communities from a starting document (which was my original motivation for this).

By default Emeka has the following namespace prefixes bound:

daml,rdf,rdfs,owl,xsd,log (n3's log), dc,rss,foaf

Emeka is a work in progress and is currently idling in #swhack and #4suite (as we speak and #foaf,#swig eventually). Some ideas for other services available from this bot:

  • augmenting it's default namespace mapping
  • stored queries (for example: query for retrieving the latest rss:item in a feed)
  • Rule invokation (through FuXi's prospective-query function)
  • Interactive question and example demonstration of Versa function(s)
  • More sophisticated interaction with Del.icio.us RSS feeds (for web page cataloging)

Other suggestions are welcome

see #swhack logs for an example

Chimezie Ogbuji

via Copia

Transhumanism

I just recently read a very insightful article on the philsophy of Transhumanism. Kuro5hin had never been on my Blog radar, but will be from now on as I find the posts there insightful and well developed. It's an interesting take on the role of humans in civilized society that critisizes our complete dependency on governing bodies to fullfil our basic needs in life. I would even extend that critisizm to include the various social customs, rules, and regulations (emphasized in some cultures and not in others) that redirect us from discovering our basic, collective needs, most of which are almost always completely removed from the "superficial demands created by modern civilization," as well as make much simplier metrics to our progress.

I recognize the motivation behind such a philosophy, but I disagree with the suggested solution:
whole-hearted embracement of technological advancement. I think as humans in a society that seems increasingly more like organized chaos than ever before, the proper remedy (from an anthropological perspective) is simplifying our goals, needs, perceptions, motivations, and interactions. When you get down the essense of the human condition, our needs have never really advanced beyond Maslow's hierarchy of human needs (and probably never will) and the frantic, anxious nature of how the average person (especially those in metropolitan societies) lives life is evidence of a disconnect from a simpler, more fundamental lifestyle.

[Uche Ogbuji]

via Copia

Quotīdiē

I think [events such as Live8] are effective at mustering attention and getting people thinking about things. What I find frustrating as someone who has written about Africa now for nearly 20 years is that the message becomes so simplified, and it's distorted in the process. I find horrible in G8, Africa Commission, the Live8 this sort of patronizing sense that "we can deliver recovery to Africa. It's in our hands. It's in our control. We the generous well-meaning West are going to deliver recovery to Africa." Things are never that simple. There's the whole issue of governance, leadership, corruption, the whole issue of countries that want to go to war. In Eritrea and Ethiopia, we have two countries for example that are still re-arming in preparation for a future war. Where does what we decide in the G8 affect that? This is not all in our remit. In my own guts, in my heart I believe that Africa's recovery will come from Africa. It will come from the young Africans I meet when I go there, who are educated, who are motivated, who know exactly what they want to do. They want to run their small businesses, they all have three mobile phones each and are extremely clear in their thinking. They don't want charity, they don't want help, they just want to be allowed to run their own businesses. I think those people are going to build their future, I don't think it's going to come from the West. I think there are things we have to do out of sheer human decency, and the trade issues come in here, but I don't think we can deliver salvation. We are not the cavalry.

Michela Wrong on NPR's Fresh Air

I heard this story last week, but it's been a hectic couple of weeks, and I've only now had a chance to comment on it. The 35 minute segment is very interesting overall, focusing on Eritrea and the fascinating, sad story of that country's abuse by colonialism and Cold War neo-colonialism. Near the end (minute 26 or so) she had the above absolute gem to offer on the general issue of today's hype over aid to Africa.

Hostess Terry Gross's question was:

Do you think mega-concerts like Live8 and its predecessor LiveAid are useful in calling attention to the issues in Africa?

And as you can read, Michela completely nailed what I and some other colleagues have to say about these matters.

She follows up with another interesting statement:

I think that debt relief comes into this, but I'm not one of those people who think you just deliver unconditional debt relief. There are countries whose dictators, for example Mobutu, whom I've written a lot about, just racked up these unspeakable debts, and it was outrageous that people ever lent money to people like Mobutu, what were they thinking of? This man was so manifestly corrupt and everybody knew what he was spending his money on. There is the issue of odious debts, but I think we have to be a little realistic and critical. I worked for a magazine that was talking about debt relief in Angola, and I felt, if you have manifestly corrupt government in places such as Angola that are brimming in diamonds and oil, is it for us to write off their debt? This is a government that has repeatedly shown that it don't give a damn about the population are is quite happy to let poverty levels, AIDS levels, education health go through the floor. Is it really for us to save Angola? I think it's time to get a little more realistic and tough talking with some of these horrible regimes that still exist in Africa. One of my main criticisms of the African Commission is that it keeps talking about this new leadership that's emerging in Africa, and I'd like to know which leaders they're talking about? Which ones in particular, because I don't see those leaders.

I think this is interesting. I think that to some extent "odious" makes up most of the debt to African by the West, whether or not to corrupt governments. As such, I do think that there is an element of moral obligation in debt relief, but it's clear that it is a dangerous distraction from the real engine of development, the professionals Michela mentions.

And this is as good a time as any to mention that even though I sometimes lump my fellow native African professionals in diaspora with our colleagues based on the continent, this is a false parity. The latter group is so much more important in the grand scheme of African development, and I get the sense, which Michela also puts across nicely in her quote, that they will soon be impossible to ignore, much as their Indian and Chinese counterparts before them.

It seems I'll be having a go at Michela's books.

[Uche Ogbuji]

via Copia

Just when you thought the American press couldn't get any more stupid...

"Newspapers warn of threat to America from 'Londonistan'"

Someone forgot to re-read the US constitution this past July 4th.

Someone developed a lacuna where the words of another Founding Father should have been:

Those willing to give up a little liberty for a little security deserve neither security nor liberty.

—Benjamin Franklin.

Someone decided that because right wing reagent media is poisonous, intimidating and loud, it must be worthy of emulation.

Someone needs to be told in no uncertain terms that London was brave, sensible, dignified and and just in her acceptance of a cosmopolitan society before the bombings, brave, sensible, dignified and and just in her conduct during the bombings, and has shown very little dimming of that bravery, good sense, dignity and justice after the bombings. There are a lot of Americans who can learn lessons from this fact, if they can hold their ridicule long enough to engage their own brain cells.

London did not throw out all its Irish residents during the IRA bombing campaign. They survived that campaign, and emerged a stronger city. Never forget that.

P.S. Also worth a read: "Bugged by the Brits"

[Uche Ogbuji]

via Copia

FuXi: FOAFed and DOAPed

I just upgraded and repackaged FuXi (v0.7): added some extra prints in the Versa extension function, added a 'test' directory to the source tree with an appropriate example of how FuXi could be used to make a prospective query against OWL rules and a source graph, created a DOAP instance for FuXi, a FOAF instance for myself, created a permanent home for FuXi, and added FuXi to the SemanticWebDOAPBulletinBoard WiKi. This was primarily motivated by Libby's post on Semantic Web Applications and Demos. I thought it would be perfect forum for FuXi. I probably need to move it into a CVS repository when I can find time.

Below is the output of running the test case:

loaded model with owl/rdfs minimal rules and initial test facts
executing Versa query:
prospective-query('urn:uuid:MinimalRules',
                               'urn:uuid:InitialFacts',
                               'distribute(list(test:Lion,test:Don_Giovanni),\'. - rdf:type -> *\')',
                               'urn:uuid:InitialFacts')
extracted 35 rules from urn:uuid:MinimalRules
extracted 16 facts from source model (scoped by urn:uuid:InitialFacts) into interpreter. Executing...
inferred 15 statements in 0.0526609420776 seconds
time to add inferred statements (to scope: urn:uuid:InitialFacts): 0.000159025192261
compiled prospective query, executing (over scope: urn:uuid:InitialFacts)
time to execute prospective query:  0.0024938583374
time to remove inferred statements:  0.0132219791412
[[[u'http://copia.ogbuji.net/files/FuXi/test/Animal',
   u'http://copia.ogbuji.net/files/FuXi/test/LivingBeing']],
 [[u'http://copia.ogbuji.net/files/FuXi/test/DaPonteOperaOfMozart']]]

This particular test extracts inferred statements from an initial graph using a functional subset of the original owl-rules.n3 and executes the Versa query (which essentially asks: what classes do the resources test:Lion and test:Don_Giovanni belong to?) to demonstrate OWL reasoning (class membership extension by owl:oneOf and owl:unionOf, specifically).

see previous posts on FuXi/N3/4Suite:

Chimezie Ogbuji

via Copia