Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia

Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia

Copia housecleaning

I've finally had some time today, as I prepare for the holidays, to fix some things on Copia that have been broken for too long. Some of the highlights, especially concerning issues mentioned by readers (thanks, guys), are:

RSS 1.0 feed body fix. Added rss:description field for the RSS 1.0 feed, which fixes missing post bodies in readers such as Bloglines which don't support support content:encoded. I do truncate the field to 500 characters, according to the recommendation in the spec.

Single entry view title fix. Added entry titles for single entry pages. Before today, if you viewed this entry through the perma-link, the title would just say "Copia"; now it says "Copia ✏Copia housecleaning"). I've wanted to do this for a while, but I was having the devil of a time figuring out how to do it with PyBlosxom. A scolding from Dan Connolly forced me to chase down a fix. For other PyBlosxom users the trick is to use the comments plug-in, copy the head.* flavor file to comment-head.*, and then update to use the $title variable, which is the title of the entry itself ($blog-title is the title of the entire blog). In my case the updated HTML header template looks like:

<title>$blog_title &#x270F;$title</title>

I did get a report that Copia is incorrectly sending `Content-Type header text/html;charset=ISO-8859-1`, but when I check using the LiveHTTPHeaders extension for FireFox on Linux it reports the correct charset=UTF-8 from the server. If anyone else can corroborate this issue, please leave a comment with the specific URL from which you noticed the error, your platform and browser, and the HTTP sniffing tool were you using. Thanks.

[Uche Ogbuji]

via Copia

Copia housecleaning

I've finally had some time today, as I prepare for the holidays, to fix some things on Copia that have been broken for too long. Some of the highlights, especially concerning issues mentioned by readers (thanks, guys), are:

RSS 1.0 feed body fix. Added rss:description field for the RSS 1.0 feed, which fixes missing post bodies in readers such as Bloglines which don't support support content:encoded. I do truncate the field to 500 characters, according to the recommendation in the spec.

Single entry view title fix. Added entry titles for single entry pages. Before today, if you viewed this entry through the perma-link, the title would just say "Copia"; now it says "Copia ✏Copia housecleaning"). I've wanted to do this for a while, but I was having the devil of a time figuring out how to do it with PyBlosxom. A scolding from Dan Connolly forced me to chase down a fix. For other PyBlosxom users the trick is to use the comments plug-in, copy the head.* flavor file to comment-head.*, and then update to use the $title variable, which is the title of the entry itself ($blog-title is the title of the entire blog). In my case the updated HTML header template looks like:

<title>$blog_title &#x270F;$title</title>

I did get a report that Copia is incorrectly sending `Content-Type header text/html;charset=ISO-8859-1`, but when I check using the LiveHTTPHeaders extension for FireFox on Linux it reports the correct charset=UTF-8 from the server. If anyone else can corroborate this issue, please leave a comment with the specific URL from which you noticed the error, your platform and browser, and the HTTP sniffing tool were you using. Thanks.

[Uche Ogbuji]

via Copia

Don't give me that monkey-ass Web 1.0, either

Musing about whether XML and RDF are too hard (viz. Mike Champion's summary of Bosworth), and whether XQuery and OWL are really the right food for better XML tools (viz: Mike Champion's summary of Florescu), my first reaction was to the latter idea, especially with respect to XQuery. I argued that declarative programming is the key, but that it is quite possible to take advantage of declarative programming outside of XQuery. Nothing new there: I've been arguing the marriage of XML and declarative techniques within "agile" languages for years. I don't think that declarative techniques inevitably require bondage-and-discipline type systems (thanks to Amyzing (1), (2) for that killer epithet).

Since then, I've also been pondering the XML-too-hard angle. I think folks such as Adam Bosworth are discounting the fact that as organizations increasingly build business plans around aggregation and integration of Web material, there comes an inevitable backlash against the slovenliness of HTML legacy and RSS Babel. Sloppy might be good enough for Google, but who makes money off that? Yeah. Just Google. Others such as Yahoo and Microsoft have started to see the importance of manageable text formats and at least modest depth of metadata. The IE7 team's "well-formed-Web-feeds-only" pledge is just one recent indicator that there will be a shake-up. No one will outlaw tag soup overnight, but as publishers find that they have to produce clean data, and some minimally clean metadata to participate in large parts of the not-Google-after-Web marketplace, they will fall in line. Of course this does not mean that there won't be people gaming the system, and all this Fancy Web agitation is probably just a big, speculative bubble that will burst soon and take with it all these centralizing forces, but at least in the medium term, I think that pressure on publishers will lead to a healthy market for good non-sloppy tools, which is the key to non-sloppy data.

Past success is no predictor of future performance, and that goes for the Web as well. I believe that folks whose scorn of "Web 2.0" takes them all the way back to what they call "Web 1.0" are buying airline stock in August of 2001.

[Uche Ogbuji]

via Copia

Addressing the RDF Scalability Bottleneck

[by Chimezie Ogbuji]

I've been building RDF persistence stores for some time (it's gone from something of a hobby to the primary responsibility in my current work capacity) and have come to the conclusion that RDF stores will almost always be succeptible to the physical limitations of database scalability.

I recall when I was at the Semantic Technology Conference this spring and asked one of the presenters there what he thought about this problem that all RDF databases face and the reason why most don't function effectively beyond 5-10 million triples. I liked his answer:

It's an engineering problem

Consider the amount of information an adult has stored (by whatever mechanism the human brain uses to persist information) in his or her noggin. We often take it for granted - as we do all other aspects of biology we know very little about - but it's worth considering when thinking about why scalability is a ubiquitous hurdle for RDF databases.

Some basic Graph theory is relevant to this point:

The size of a graph is the number of edges and the order of a graph is the number of nodes within the graph. RDF is a Resource Description Framework (where what we know about resources is key - not so much the resouces themselves) so it's not surprising that RDF graphs will almost always have a much larger size than order. It's also not suprising that most performance analysis made across RDF implementations (such as LargeTripleStores for instance) focus mostly on triple size.

I've been working on a SQL-based persistence schema for RDF content (for rdflib) that is a bit different from the standard approaches taken by most RDBMS implementations of RDF stores I'm familiar with (including those I've written). Each of the tables are prefixed with a SHA1-hashed digest of the identifier associated with the 'localized universe' (AKA,the boundary for a closed world assumptions). The schema is below:

CREATE TABLE %s_asserted_statements (
     subject       text not NULL,
     predicate     text not NULL,
     object        text,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,    
     objLanguage   varchar(3),
     objDatatype   text,
     INDEX termComb_index (termComb),    
     INDEX spoc_index (subject(100),predicate(100),object(50),context(50)),
     INDEX poc_index (predicate(100),object(50),context(50)),
     INDEX csp_index (context(50),subject(100),predicate(100)),
     INDEX cp_index (context(50),predicate(100))) TYPE=InnoDB

 CREATE TABLE %s_type_statements (
     member        text not NULL,
     klass         text not NULL,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,
     INDEX termComb_index (termComb),
     INDEX memberC_index (member(100),klass(100),context(50)),
     INDEX klassC_index (klass(100),context(50)),
     INDEX c_index (context(10))) TYPE=InnoDB"""

 CREATE TABLE %s_quoted_statements (
     subject       text not NULL,
     predicate     text not NULL,
     object        text,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,
     objLanguage   varchar(3),
     objDatatype   text,
     INDEX termComb_index (termComb),
     INDEX spoc_index (subject(100),predicate(100),object(50),context(50)),
     INDEX poc_index (predicate(100),object(50),context(50)),
     INDEX csp_index (context(50),subject(100),predicate(100)),
     INDEX cp_index (context(50),predicate(100))) TYPE=InnoDB

The first thing to note is that statements are partitioned into logical groupings:

  • Asserted non rdf:type statements: where all asserted RDF statements where the predicate isn't rdf:type are stored
  • Asserted rdf:type statements: where all asserted rdf:type statements are stored
  • Quoted statements: where all quoted / hypothetical statements are stored

Statement quoting is a Notation 3 concept and an extension of the RDF model for this purpose. The most significant partition is the rdf:type grouping. The idea is to have class membership modeled at the store level instead of at a level above it. RDF graphs are as different as the applications that use them but the primary motivating factor for making this seperation was the assumption that in most RDF graphs a majority of the statements (or a significant portion) would consist of rdf:type statements (statements of class membership).

Class membership can be considered an unstated RDF modelling best practice since it allows an author to say alot about a resource simply by associating it with a class that has it's semantics completely spelled out in a separate, supporting ontology.

The rdf:type table models class membership explicitely with two columns: klass and member. This results in a savings of 43 characters per rdf:type statement. The implementation takes note of the predicate submitted in triple-matching pattern and determines which tables to search.
Consider the following triple pattern:

http://metacognition.info ?predicate ?object

The persistence layer would know it needs to check against the table that persists non rdf:type statements as well as the class membership table. However, patterns that match against a specific predicate (other than rdf:type) or class membership queries only need to check within one partition (or table):

http://metacognition.info rdf:type ?klass

In general, I've noticed that being able to partition your SQL search space (searching within a named graph / context or searching within a single table) goes along way in query response.

The other thing worth noting is the termComb column, which is an integer value representing the 40 unique ways the following RDF terms could appear in a triple:

  • URI Ref
  • Blank Node
  • Formula
  • Literal
  • Variable

I'm certain there are many other possible optimizations that can be made in a SQL schema for RDF triple persistence (there isn't much precedent in this regard - and Oracle has only recently joined the foray) but partitioning rdf:type statements seperately is one such thought I've recently had.

[Uche Ogbuji]

via Copia

Cop it while it's hot: 4Suite XML 1.0b2

Updated with working link for the manual

We've announced 4Suite XML 1.0b2 We're a big step towards a 1.0 release, even bigger than most of our releases because what we've done with this is trim the overall 4Suite package to a sensible size and scope. This release contains only the XML processing core and some support libraries. It does not contain the RDF libraries and the repository. This does not mean those components are stranded (see, for example, the rdflib/4RDF merger effort for a sense of the new juice being fed into 4Suite/RDF). It's just that the core XML libraries are so much more mature than the other components, and so much more widely used, that it made no sense not to set it free and let it quickly march to 1.0 on its own. This release features some serious performance improvements, some simplified APIs for a gentler user learning curve, and a lot of fixes and other improvements (see the announcement for the full, long list).

In fact, the code we released is just about 1.0 in itself, as far as the XML component goes. A code freeze is in place, and we'll focus on fixing bugs and wrapping up the user manual effort. (BTW, if you'd like to help chip into the manual, please say so on the 4Suite-dev mailing list; there is a lot of material in place, and what we need is mostly in the way of editing and improving details). Our plan is to get the XML core to 1.0 more quickly than we would have been able before breaking 4Suite into components, and then we can focus on RDF and the repository. 4Suite/RDF will probably disappear into the new rdflib, and the repository will probably go through heavy refactoring and simplification.

Today, after some day-job tasks, my priority will be getting Amara 1.1.5 out. It's been largely ready and waiting for the 4Suite release. Some really sweet improvements and additions in this Amara release (though I do say so myself). More on that later.

[Uche Ogbuji]

via Copia

4Suite and Rdflib - Advancing the State of The Art in Python / RDF

A few days ago I checked in a 4RDF Driver that wraps the rdflib persistence API. 4RDF has a standard API for abstracting the persistence of RDF that sits directly below the Model interface and allows an author to implement a mechanism for persisting an RDF graph in whatever database, filesystem, etc.. they choose. rdflib has a similar seperation but the actual interfaces differ. Daniel Krech and I have been working rather diligently on formalizing a universal persistence API that allows an implementation to support a graduated set of features:

  • Core RDF Store
  • Context-aware RDF Store
  • Notation 3 RDF Store (Formula-aware RDF Store)

It's still a work in progress (with regards to Notation 3 persistence, mostly) but at least those interfaces/method signatures needed by a context-aware RDF store are well spec'd out.

I won't bore you with the details (and there are plenty - as we've covered alot of ground) but you can dive in here. This driver, which allows 4Suite to use rdflib for persistence of it's RDF data, is the first step an an agreed migration that will phase out 4RDF and replace it with rdflib, eventually. This module, at the very least, allows for the dispatching of Versa queries on an rdflib Graph, is the first step in allowing a 4Suite repository to it's RDF graphs in an rdflib store - I think there is alot of synthesis worth exploring with redfoot, provides rdflib with access to a rather voluminous 4RDF test suite, and demonstrates how existing applications that use 4RDF could be ported to use rdflib instead.

Outstanding / Possible Issues:

1) the current rdflib Graph interfaces do not account for RDF reification. These are somewhat covered by support for Notation 3 quoted/hypothetical contexts. The only visible difference is in the test cases that match by statementUri

2) This driver has only been tested against the MySQL implementation of an rdflib store. This is mainly because it's currently the only rdflib store implementation that supports matching arbitrary triple / statement terms by REGEX patterns and/or the production of quads instead of triples (i.e. the name of the context of each resulting statement in addition). This is only an issue for RDF stores that are at least context-aware, but an interface mismatch at most.

I plan to do some more experimentation on the possiblities that this synthesis provides (surprise, surprise). The timing is rather appropriate given the on-going development on the next generation Versa query specification, the concurrent effort to graduate the 4Suite code base to 1.0, and my recent pleasant surprise regarding Versa, Sparta, and rdflib.

If you are interested in helping or learning more about the roadmap, you can pay #redfoot (on irc.freenode.net) a visit. That's where Daniel Kreck and the other rdflib folks have been burning braincells as of late.

Chimezie Ogbuji

via Copia

Redfoot: Updated Documentation

Daniel Krech, recently updated the Redfoot homepage with some additional documentation on what Redfoot is. It's a very interesting concept for leveraging Python (or any other scriptable language) and RDF as a distributed framework for applications.

Beyond the known advantages of modelling distributed components on an RDF Graph with well defined semantics for how you retrieve programs and execute them it also relies on a hybrid of XML and Python called Kid to facilitate templating of HTML.

The advantages of using a flexible programming language (such as Python) for manipulating XML is well written about (sift through the Copia archives, you'll find plenty). Couple that with a well modelled framework for including and executing remote modules as well as a programmatic access (using a similar idiom) to an underlying RDF Graph and you have yourself a very flexible foundation.

For example. Below is the Kid template used to render the contributers page on rdflib.net:

<div xmlns="http://www.w3.org/1999/xhtml"
 xmlns:kid="http://purl.org/kid/ns#">
    <?python
    FOAF = redfoot.namespace("http://xmlns.com/foaf/0.1/")
    DOAP = redfoot.namespace("http://usefulinc.com/ns/doap#")
    project = URIRef("%s#" % request.host)
    people = []
    seen = set()
    for property in [DOAP.maintainer, DOAP.developer, DOAP.documenter,    DOAP.translator, DOAP.tester,DOAP.helper]:
        for person in redfoot.objects(project, property):
            if person not in seen:
                seen.add(person)
                label = redfoot.label(person) or person
                relationships = set()
                for relationship in redfoot.predicates(project, person):
                    relationships.add(redfoot.label(relationship))
                people.append((label, person, relationships))

    people.sort()
    ?>

    <ul>
      <li kid:for="label, person, relationships in people">
        ${label},
    ${redfoot.value(person, FOAF.nick)},
    (${", ".join(relationships)})
      </li>
    </ul>
</div>

Redfoot feels like a hybrid of Narval and the 4Suite repository and represents what is common between the tangential goals of those two projects.

rdflib.net and redfoot.net (as well as some other sites) are examples of applications that run on a Redfoot instance.

[Uche Ogbuji]

via Copia

Corrections to RDF Query Language Comparison

I recently came upon this dated comparison of features in RDF querying languages. It predates SPARQL and as a result prompted Dan Connoly to attempt to demonstrate how SPARQL fares in this matrix. In reading it, I realized some of the features marked as 'No' under Versa are incorrect. So here is my attempt to demonstrate how (current Versa specification) would implement these requirements:

5 Quantification: Return the persons who are authors of all publications

The section name is misleading as it suggests the use of FOPL semantics to resolve a pattern that can be solved without FOPL semantics:

filter(
        'all()',
        'eq(length(difference(all()-dc:creator->*. - dc:creator->*)),0)'
    )

10 Namespace: Return all resources whose namespace starts with "http://www.aifb.uni-karlsruhe.de/".

filter(
        'all()',
        'starts-with("http://www.aifb.uni-karlsruhe.de/")'
    )

14 Entailment: Return all instances of that are members of the class Publication.

The current specification says (about rdf:type entailment):

Returns a list of all resources of a specified type, as defined by RDFS and optionally DAML schema specifications ...

So, the implementation has the option to account for entailment rules (as the 4Suite implementation does):

type(resource('#Publication'))

Chimezie Ogbuji

via Copia