Amara 1.1.6

I released Amara 1.1.6 last week (see the announcement). This version requires 4Suite XML 1.0b2. As usual, though, I have prepared an "allinone" package so that you do not need to install 4Suite separately to use Amara.

The biggest improvements in ths release are to performance and to the API. Amara takes advantage of a lot of the great performance work that has gone into 4Suite (e.g. Saxlette). There is also a much easier API on-ramp that I expect most users will appreciate. Rather than having to parse using:

from amara import binderytools as bt
doc = bt.bind_string(XML) #or bt.bind_uri or bt.bind_file or bt.bind_stream

You can use

import amara
amara.parse(XML) #Whether XML is string, file-like object, URI or local file path

There are several other such simplifications. There is also the xml_append_template facility, which is very handy for generating XML (see how Sylvain uses it to simplify atomixlib).

Thanks to all the folks who helped with suggestions, patches, review, etc.

[Uche Ogbuji]

via Copia

Cop it while it's hot: 4Suite XML 1.0b2

Updated with working link for the manual

We've announced 4Suite XML 1.0b2 We're a big step towards a 1.0 release, even bigger than most of our releases because what we've done with this is trim the overall 4Suite package to a sensible size and scope. This release contains only the XML processing core and some support libraries. It does not contain the RDF libraries and the repository. This does not mean those components are stranded (see, for example, the rdflib/4RDF merger effort for a sense of the new juice being fed into 4Suite/RDF). It's just that the core XML libraries are so much more mature than the other components, and so much more widely used, that it made no sense not to set it free and let it quickly march to 1.0 on its own. This release features some serious performance improvements, some simplified APIs for a gentler user learning curve, and a lot of fixes and other improvements (see the announcement for the full, long list).

In fact, the code we released is just about 1.0 in itself, as far as the XML component goes. A code freeze is in place, and we'll focus on fixing bugs and wrapping up the user manual effort. (BTW, if you'd like to help chip into the manual, please say so on the 4Suite-dev mailing list; there is a lot of material in place, and what we need is mostly in the way of editing and improving details). Our plan is to get the XML core to 1.0 more quickly than we would have been able before breaking 4Suite into components, and then we can focus on RDF and the repository. 4Suite/RDF will probably disappear into the new rdflib, and the repository will probably go through heavy refactoring and simplification.

Today, after some day-job tasks, my priority will be getting Amara 1.1.5 out. It's been largely ready and waiting for the 4Suite release. Some really sweet improvements and additions in this Amara release (though I do say so myself). More on that later.

[Uche Ogbuji]

via Copia

Processing "Web 2.0" using XSLT document() variants? No thanks.

Mark Nottingham has written an intriguing piece "XSLT for the Rest of the Web". It's drummed up some interest, some of which has even leaked into the 4Suite mailing list thanks to the energetic Sylvain Hellegouarch. Mark says:

I’ve raved before about how useful the XSLT document() function is, once you get used to it. However, the stars have to be aligned just so to use it; the Web site can’t use cookies for anything important, and the content you’re interested in has to be available in well-formed XML.

He goes on to present a set of extension functions he's created for libxslt. They are basically smarter document() functions that can do fancy Web things, including HTTP POST, and using HTML Tidy to grab tag soup HTML as XHTML.

As I read through it, I must say my strong impression was "been there, done that, probably never looking back". Certainly no diss of Mark intended there. He's one of the sharper hackers I know. I guess we're just at different points in our thinking of where XSLT fits into the Web-savvy apps toolkit.

First of all, I think the Web has more dragons than you could easily tame with even the mightiest XSLT extension hackery. I think you need general-purpose programming language to wrangle "Web 2.0" without drowning in tears.

More importantly, if I ever needed XSLT's document() function to process anything more than it's spec'ed to, I would consider that a pretty strong indicator that it's time to rethink part of my application architecture.

You see, I used to be a devotee of XSLT all over the place, and XSLT extensions for just about every limitation of the language. Heck, I wrote a whole framework of such things into 4Suite Repository. I've since reformed. These days I take the pipeline approach to such processing, and I keep XSLT firmly in the narrow niche for which it was designed. I have more on this evolution of thinking in "Lifting XSLT into application domain with extension functions?".

But back to Mark's idea. I actually implemented 4Suite XSLT extensions to use HTTP POST and to tidy tag soup HTML into XHTML, but I wouldn't dream of using these extensions any more. Nowadays, I use Python to gather and prepare data into a model representation that I then hand over to XSLT for pure presentation processing. Complex logical tasks such as accessing Web data beyond trivially fetched XML are matters for the model layer, and not the presentation logic. For example, if I need to tidy something, I tidy it at the Python level and put what I need of the resulting XHTML into the model XML before passing it to XSLT. I use Amara XML Toolkit with John Cowan's TagSoup for my tidying needs. I prefer TagSoup rather than tidy because I find it's faster and more robust.

Even if you use the libxml2 family of tools, I still think it's better to use libxml, and perhaps the libxml HTML parser to do the model processing and hand over resulting XML to libxslt in a separate step.

XSLT is pretty cool, but these days rather than reproduce all of Python's dozens of Web processing libraries therein, I plump for Python itself.

[Uche Ogbuji]

via Copia

4Suite and Rdflib - Advancing the State of The Art in Python / RDF

A few days ago I checked in a 4RDF Driver that wraps the rdflib persistence API. 4RDF has a standard API for abstracting the persistence of RDF that sits directly below the Model interface and allows an author to implement a mechanism for persisting an RDF graph in whatever database, filesystem, etc.. they choose. rdflib has a similar seperation but the actual interfaces differ. Daniel Krech and I have been working rather diligently on formalizing a universal persistence API that allows an implementation to support a graduated set of features:

  • Core RDF Store
  • Context-aware RDF Store
  • Notation 3 RDF Store (Formula-aware RDF Store)

It's still a work in progress (with regards to Notation 3 persistence, mostly) but at least those interfaces/method signatures needed by a context-aware RDF store are well spec'd out.

I won't bore you with the details (and there are plenty - as we've covered alot of ground) but you can dive in here. This driver, which allows 4Suite to use rdflib for persistence of it's RDF data, is the first step an an agreed migration that will phase out 4RDF and replace it with rdflib, eventually. This module, at the very least, allows for the dispatching of Versa queries on an rdflib Graph, is the first step in allowing a 4Suite repository to it's RDF graphs in an rdflib store - I think there is alot of synthesis worth exploring with redfoot, provides rdflib with access to a rather voluminous 4RDF test suite, and demonstrates how existing applications that use 4RDF could be ported to use rdflib instead.

Outstanding / Possible Issues:

1) the current rdflib Graph interfaces do not account for RDF reification. These are somewhat covered by support for Notation 3 quoted/hypothetical contexts. The only visible difference is in the test cases that match by statementUri

2) This driver has only been tested against the MySQL implementation of an rdflib store. This is mainly because it's currently the only rdflib store implementation that supports matching arbitrary triple / statement terms by REGEX patterns and/or the production of quads instead of triples (i.e. the name of the context of each resulting statement in addition). This is only an issue for RDF stores that are at least context-aware, but an interface mismatch at most.

I plan to do some more experimentation on the possiblities that this synthesis provides (surprise, surprise). The timing is rather appropriate given the on-going development on the next generation Versa query specification, the concurrent effort to graduate the 4Suite code base to 1.0, and my recent pleasant surprise regarding Versa, Sparta, and rdflib.

If you are interested in helping or learning more about the roadmap, you can pay #redfoot (on irc.freenode.net) a visit. That's where Daniel Kreck and the other rdflib folks have been burning braincells as of late.

Chimezie Ogbuji

via Copia

More on the PyBlosxom del.icio.us plug-in, and introducing task_control.py, a a pseudo cron plug-in for PyBlosxom

Micah put my del.icio.us daily links tool to immediate use on his blog. He uncovered a bug in the character handling, which is now fixed in the posted amara_delicious.py file.

I usually invoke the script from cron, but Micah asked if there was an alternative. I've been meaning to hack up a poor man's cron for PyBlosxom and this gave me an additional push. The result is task_control.py.

A sort of poor man's cron for PyBlosxom, this plug-in allows you to specify tasks (as Python scripts) to be run only at certain intervals Each time the plug-in is invoked it checks a set of tasks and the last time they were run. It runs only those that have not been run within the specified interval.

To run the Amara del.icio.us daily links script once a day, you would add the following to your config file:

py["tasks"] = {"/usr/local/bin/amara_delicious.py": 24*60*60}
py["task_control_file"] = py['datadir'] + "/task_control.dat"

You could of course have multiple script/interval mappings in the "tasks" dict. The scripts are run with variables request and config set, so, for example, if running from task_control.py, you could change the line of amara_delicious.py from

BASEDIR = '/srv/www/ogbuji.net/copia/pyblosxom/datadir'

to

BASEDIR = config['datadir']

[Uche Ogbuji]

via Copia

del.icio.us daily links, using Amara

I added a new feature on Copia: Every day there will be an automated posting with mine and Chime's del.icio.us links from the previous day. You can see, in the previous Copia entry to this one, an example of the results.

What I think most cool is how easy it was to write, and how easy the resulting code is to understand. It's just 35 lines (including 7 lines of imports) , and in that it packs some useful features I haven't found in other such scripts, including:

  • Full Unicode safety (naturally, I wouldn't have it any other way)
  • support for multiple del.icio.us feeds, with tag by author
  • tagging the PyBlosxom entry with the aggregated/unique tags from the del.icio.us entries

Here's the code. The only external requirement is Amara:

import os
import sets
import time
import codecs
import itertools
from datetime import date, timedelta

from amara import binderytools

TAGBASE = 'http://del.icio.us/tag/'

#Change BASEDIR and FEEDS to customize
BASEDIR = '/srv/www/ogbuji.net/copia/pyblosxom/datadir'
FEEDS = ['http://del.icio.us/rss/uche', 'http://del.icio.us/rss/chimezie']

now = time.gmtime()
timestamp = unicode(time.strftime('%Y-%m-%dT%H:%M:%SZ', now))
targetdate = (date(*now[:3]) - timedelta(1)).isoformat()

#Using Amara.  Easy to just grab the RSS feed
docs = map(binderytools.bind_uri, FEEDS)
items = itertools.chain(*[ doc.RDF.item for doc in docs ])
current_items = [ item for item in items
                       if unicode(item.date).startswith(targetdate) ]
if current_items:
    # Create a Markdown page with the daily bookmarks.
    dir = '%s/%s' % (BASEDIR, targetdate)
    if not os.path.isdir(dir):
        os.makedirs(dir)
    f = codecs.open('%s/%s/del.icio.us.links.txt' % (BASEDIR, targetdate), 'w', 'utf-8')

    # Pyblosxom Title
    f.write(u'del.icio.us bookmarks for %s\n' % targetdate)

    tags = sets.Set()
    for item in current_items:
        tags.update([ li.resource[len(TAGBASE):] for li in item.topics.Bag.li ])
    f.write(u'#post_time %s\n'%(timestamp))
    f.write(u'<!--keywords: del.icio.us,%s -->\n'%(u','.join(tags)))

    for item in current_items:
        # List of links in Markdown.
        title = getattr(item, 'title', u'')
        href = getattr(item, 'link', u'')
        desc = getattr(item, 'description', u'')
        creator = getattr(item, 'creator', u'')
        f.write(u'* "[%s](%s)": %s *(from %s)*\n' % (title, href, desc, creator))

    f.close()

Or download amara_delicious.py.

You can see how easily you can process RSS 1.0 in Amara. I don't think actual RDF parsing/processing is a bit necessary. That extra layer is the first thing that decided me against Matt Biddulph's module, in addition to his use of libxml for XML processing, which is also used in Roberto De Almeida's.

[Uche Ogbuji]

via Copia

Hooking up an IRC Agent to a Query Interface

Uche gave an excellent suggestion to augment Emeka to work with Triclops. After finishing Triclops, I had realized that most of the functionality Emeka provided was now redundant since it could be performed using Triclops (with the added advantage of being able to diagram/navigate RDF Graphs). Triclops URIs are unfortunately very long for queries submitted through HTTP GET. This is unfortunately unavoidable mostly because the parameters to the query service are Versa queries (which have their own syntax that includes characters that may not be very friendly for URIs) and URIs, so the prospect of submitting URIs to a service identified to a URI will probably always be problematic and might be as fundamental as submitting XML content through an XML-based messaging format. I'm digressing, but the solution that worked for me (which Uche suggested) was to make use of a URI 'shortening' service (such as MakeAShorterLink - which I used) to generate a 'hashed' or 'compacted' form of the URI from the parts submitted from , say, IRC. For example, lets say I was interested in Daniel Krech's redfoot framework and wanted to navigate the ontology (available from the URL: http://redfoot.net/2005/redfoot) as an RDF diagram in order to get an idea of the components and how they related. In particular, lets say I knew to start with the redfoot:Code class, I could interactively ask Emeka to return to execute the query:

redfoot:Code

against the URL which returns the RDF/XML serialization of the redfoot ontology. I could also ask Emeka to have the result rendered as a JPEG diagram with an associated image map that allows me to navigate to other aspects of the ontology to learn more about redfoot. Below is how I would do so with the new .triclops service:

<chimezie> .triclops "redfoot:Code" http://redfoot.net/2005/redfoot jpeg
<Emeka> Shortened link to Triclops request: http://makeashorterlink.com/?I1F7262EB

Below is the help associated with the triclops service.

<chimezie> .help triclops
<Emeka> .triclops "..Versa Query.." <source graph> <triples|jpeg|svg|versa>
<Emeka> The first argument is quoted Versa query to execute.  The second is a URL which points to the RDF graph (N3 or RDF/XML) to query against
<Emeka> The third argument is one of "triples", "jpeg", or "svg" and specifies how to return the query result
<Emeka> "triples" - raw triples in a tabled-grid, "jpeg" or "svg" - as navigable RDF graphs, and "versa" - raw Versa datatypes (rendered as html)
<Emeka> The result is a uri (courtesy of http://makeashorterlink.com) which redirects to the appropriate Triclops request

[Uche Ogbuji]

via Copia

BNode Drama for your Mama

You know you are geek when it's 5am in the morning and you are wrestling with existential quantification and their value in querying. This was triggered originally by the ongoing effort to extend an already expressive pattern-based RDF querying language to cover more usecases. The motivation being that such patterns should be expressive beyond just the level of triple-matching since the core RDF model consists of a level of granularity below statements (you have literals, resources, and bnodes, ..). I asked myself if there was a justifiable reason why Versa at it's core does not include BNodes?:

Blank nodes are treated as simply indicating the existence of a thing, without using, or saying anything about, the name of that thing. (This is not the same as assuming that the blank node indicates an 'unknown' URI reference; for example, it does not assume that there is any URI reference which refers to the thing. The discussion of Skolemization in appendix A is relevant to this point.)

I don't remember the original motivation for leaving BNodes out of the core query data types, but in retrospect I think it it was a good decision and not only because the SPARQL specification does something similar (in interpreting BNodes as an open-ended variable). But it's worth noting that the section on blank nodes appearing in a query as opposed to appearing to a query result (or existing in the underlying knowledge base) is quite short:

A blank node can appear in a query pattern. It behaves as a variable; a blank node in a query pattern may match any RDF term.

Anyways, at the time I noticed this lack of BNodes in query languages, I had a misconception about BNodes. I thought they represented individual things we want to make statements about but don't know their identification or don't want to have to worry about assigning identification about them (this is probably 90% of the way BNodes are used in reality). This confusion came from the practical way BNodes are almost always handled by RDF data stores (Skolemization):

Skolemization is a syntactic transformation routinely used in automatic inference systems in which existential variables are replaced by 'new' functions - function names not used elsewhere - applied to any enclosing universal variables. In RDF, Skolemization amounts to replacing every blank node in a graph by a 'new' name, i.e. a URI reference which is guaranteed to not occur anywhere else. In effect, it gives 'arbitrary' names to the anonymous entities whose existence was asserted by the use of blank nodes: the arbitrariness of the names ensures that nothing can be inferred that would not follow from the bare assertion of existence represented by the blank node.

This misconception was clarified when Bijan Parsia ({scope(PyChinko)} => {scope(FuXi)}) expressed that he had issue with my assertion(s) that there are some compromising redundancies with BNodes, Literals, and simple entailment with regards to building programmatic APIs for them.

Then the light bulb went off that the semantics of BNodes are (as he put it) much stronger than they are most often used. Most people who use BNodes don't mean to use it to state that there is a class of things which have the asserted set of statements made about them. Consider the difference between:

  1. Who are all the people Chime knows?
  2. There is someone Chime knows, but I just don't know his/her name right now
  3. Chime knows someone! (dudn't madder who)

The first scenario is the basic use case for variable resolution in an RDF query and is asking for the resolution of variable ?knownByChime in:

<http://metaacognition.info/profile/webwho.xrdf#chime> foaf:knows ?knownByChime.

Which can be [expressed] in Versa (currently) as:

resource('http://metacognition.info/profile/webwho.xrdf#chime')-foaf:knows->*

Or eventually (hopefully) as:

foaf:knows(<http://metacognition.info/profile/webwho.xrdf#chime>)

And in SPARQL as:

select 
  ?knownByChime 
where  
{
  <http://metacognition.info/profile/webwho.xrdf#chime> foaf:knows ?knownByChime
}

The second case is the most common way people use BNodes. You want to say Chime knows someone but don't know a permanent identifier for this person or care to at the time you make the assertion:

http://metaacognition.info/profile/webwho.xrdf#chime foaf:knows _:knownByChime

But RDF-MT specifically states that BNodes are not meant to be interpreted in this way only. Their semantics are much stronger. In fact, as Bijaan pointed out to me, the proper use for BNodes is as scoped existentials within ontological assersions. For example owl:Restrictions which allow you to say things like: The named class KnowsChime consists of everybody who knows Chime:

@prefix mc <http://metaacognition.info/profile/webwho.xrdf#>.
  @prefix owl <http://www.w3.org/2002/07/owl#>.
  :KnowsChime a owl:Class;
        rdfs:subClassOf 
        [
          a owl:Restriction;
          owl:onProperty foaf:knows;
          owl:hasValue mc:chime
        ];
        rdfs:label "KnowsChime";
        rdfs:comment "Everybody who knows Chime";

The fact that BNodes aren't meant to be used in the way they often are leads to some suggested modifications to allow BNodes to be used as 'temporary identifiers' in order to simplify query resolution. But as clarified in the same thread, BNodes in a query doesn't make much sense - which is the conclusion I'm coming around to: There is no use case for asserting an existential quantification while querying for information against a knowledge base. Using a variable (in the way SPARQL does) should be sufficient. In fact, all RDF querying usecases (and languages) seem to be reducable to variable resolution.

This last part is worth noting because it suggests that if you have a library that handles variable resolution (such as rdflib's most recent addition) you can map any query language to (Versa/SPARQL/RDFQueryLanguage_X) it by reducing it to a set of triple patterns with the variables you wish to resolve.

So my conclusions?:

  • Blank Nodes are a neccessary component in the model (and any persistence API) that unfortunately have much stronger semantics (existential quanitifcation) than their most common use (as temporary identifiers)
  • The distinction between the way BNodes are most often used (as a syntactic shorthand for a single resource for which there is no known identity - at the time) and the formal definition of BNodes is very important to note - especially to those who are very much wed to their BNodes as Shelly Powers has shown to be :).
  • Finally, BNodes emphatically do not make sense in the context of a query - since they become infinitely resolvable variables: which is not very useful. This confusion is further proof that (once again), for the sake of minimizing said confusion and misinterpretation of some very complicated axioms there is plenty value in parenthetically (if not logically) divorcing (pun intended) RDF model theoretics from the nuts and bolts of the underlying model

Chimezie Ogbuji

via Copia

RSS feeds for 4Suite (etc.) mailing lists

Jeremy Kloth set up RSS content feeds for Fourthought-hosted mailing lists, including the 4Suite and EXSLT lists (all on Mailman). The list information page for all the lists has an RSS link in the header, so it should be picked up by most news readers. For convenience, though, here are the main lists and the corresponding feeds:

[Uche Ogbuji]

via Copia