Agile Web #1: "Google Sitemaps"

"Google Sitemaps"

Uche Ogbuji's new XML.com column, "Agile Web," explores the intersection of agile programming languages and Web 2.0. In this first installment he examines Google's Sitemaps schema, as well as Python and XSLT code to generate site maps. [Oct. 26, 2005]

And with this article the "Python and XML" column has been replaced by a new one titled "Agile Web".

I wrote the Python-XML column for three years, discussing the combination of an agile programming language with an agile data format. It's time to pull the lens back a bit to take in other such technologies. This new column, "Agile Web," will cover the intersection of dynamic programming languages and web technologies, particularly the sorts of dynamic developments on the web for which some use the moniker, "Web 2.0." The primary language focus will still be Python, with some ECMAScript. Occasionally there will be some coverage of other dynamic languages as well.

In this first article I introduce the Google SiteMaps program, XML format and Python tools.

[Uche Ogbuji]

via Copia

Addressing the RDF Scalability Bottleneck

[by Chimezie Ogbuji]

I've been building RDF persistence stores for some time (it's gone from something of a hobby to the primary responsibility in my current work capacity) and have come to the conclusion that RDF stores will almost always be succeptible to the physical limitations of database scalability.

I recall when I was at the Semantic Technology Conference this spring and asked one of the presenters there what he thought about this problem that all RDF databases face and the reason why most don't function effectively beyond 5-10 million triples. I liked his answer:

It's an engineering problem

Consider the amount of information an adult has stored (by whatever mechanism the human brain uses to persist information) in his or her noggin. We often take it for granted - as we do all other aspects of biology we know very little about - but it's worth considering when thinking about why scalability is a ubiquitous hurdle for RDF databases.

Some basic Graph theory is relevant to this point:

The size of a graph is the number of edges and the order of a graph is the number of nodes within the graph. RDF is a Resource Description Framework (where what we know about resources is key - not so much the resouces themselves) so it's not surprising that RDF graphs will almost always have a much larger size than order. It's also not suprising that most performance analysis made across RDF implementations (such as LargeTripleStores for instance) focus mostly on triple size.

I've been working on a SQL-based persistence schema for RDF content (for rdflib) that is a bit different from the standard approaches taken by most RDBMS implementations of RDF stores I'm familiar with (including those I've written). Each of the tables are prefixed with a SHA1-hashed digest of the identifier associated with the 'localized universe' (AKA,the boundary for a closed world assumptions). The schema is below:

CREATE TABLE %s_asserted_statements (
     subject       text not NULL,
     predicate     text not NULL,
     object        text,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,    
     objLanguage   varchar(3),
     objDatatype   text,
     INDEX termComb_index (termComb),    
     INDEX spoc_index (subject(100),predicate(100),object(50),context(50)),
     INDEX poc_index (predicate(100),object(50),context(50)),
     INDEX csp_index (context(50),subject(100),predicate(100)),
     INDEX cp_index (context(50),predicate(100))) TYPE=InnoDB

 CREATE TABLE %s_type_statements (
     member        text not NULL,
     klass         text not NULL,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,
     INDEX termComb_index (termComb),
     INDEX memberC_index (member(100),klass(100),context(50)),
     INDEX klassC_index (klass(100),context(50)),
     INDEX c_index (context(10))) TYPE=InnoDB"""

 CREATE TABLE %s_quoted_statements (
     subject       text not NULL,
     predicate     text not NULL,
     object        text,
     context       text not NULL,
     termComb      tinyint unsigned not NULL,
     objLanguage   varchar(3),
     objDatatype   text,
     INDEX termComb_index (termComb),
     INDEX spoc_index (subject(100),predicate(100),object(50),context(50)),
     INDEX poc_index (predicate(100),object(50),context(50)),
     INDEX csp_index (context(50),subject(100),predicate(100)),
     INDEX cp_index (context(50),predicate(100))) TYPE=InnoDB

The first thing to note is that statements are partitioned into logical groupings:

  • Asserted non rdf:type statements: where all asserted RDF statements where the predicate isn't rdf:type are stored
  • Asserted rdf:type statements: where all asserted rdf:type statements are stored
  • Quoted statements: where all quoted / hypothetical statements are stored

Statement quoting is a Notation 3 concept and an extension of the RDF model for this purpose. The most significant partition is the rdf:type grouping. The idea is to have class membership modeled at the store level instead of at a level above it. RDF graphs are as different as the applications that use them but the primary motivating factor for making this seperation was the assumption that in most RDF graphs a majority of the statements (or a significant portion) would consist of rdf:type statements (statements of class membership).

Class membership can be considered an unstated RDF modelling best practice since it allows an author to say alot about a resource simply by associating it with a class that has it's semantics completely spelled out in a separate, supporting ontology.

The rdf:type table models class membership explicitely with two columns: klass and member. This results in a savings of 43 characters per rdf:type statement. The implementation takes note of the predicate submitted in triple-matching pattern and determines which tables to search.
Consider the following triple pattern:

http://metacognition.info ?predicate ?object

The persistence layer would know it needs to check against the table that persists non rdf:type statements as well as the class membership table. However, patterns that match against a specific predicate (other than rdf:type) or class membership queries only need to check within one partition (or table):

http://metacognition.info rdf:type ?klass

In general, I've noticed that being able to partition your SQL search space (searching within a named graph / context or searching within a single table) goes along way in query response.

The other thing worth noting is the termComb column, which is an integer value representing the 40 unique ways the following RDF terms could appear in a triple:

  • URI Ref
  • Blank Node
  • Formula
  • Literal
  • Variable

I'm certain there are many other possible optimizations that can be made in a SQL schema for RDF triple persistence (there isn't much precedent in this regard - and Oracle has only recently joined the foray) but partitioning rdf:type statements seperately is one such thought I've recently had.

[Uche Ogbuji]

via Copia

Peace, Rosa

So anyone vaguely aware of the world around them would have heard that Civil Rights hero Rosa Parks died Monday. Enough superlatives have been lavished on her courage and conviction in providing the spark for MLK's campaign of non-violent civil disobedience.

That's all well, but I cannot help that the main thing that comes to my mind when I contemplate Rosa Parks is that she is also a symbol of how some remaining members of that movement insist on fawning appreciation of their legacy. Getting lectured by the Jesse Jacksons and Bill Cosbys of the world is annoying enough even for those who, like me, freely admit their gratitude towards the Civil Rights movement, but a particularly galling stroke was Mrs. Parks's lawsuit against OutKast for their hit song "Rosa Parks". Watch out for the hook:

Ah ha, hush that fuss
Everybody move to the back of the bus
Do you wanna bump and slump with us
We the type of people make the club get crunk

So the hook was a little irreverent, but besides the title, there was very little in the song connecting to the Parks story. Maybe that was the crime: any song referencing a Civil Rights hero must be a somber appreciation of their struggle (see "A Song for Assata"). Does it really serve the memory of the Civil Rights movement to be so incredibly petty?

But wait. There's a twist, of course. It seems it's likely that Mrs. Parks wasn't really behind the lawsuit, but rather attorneys and hangers-on who saw the supposed affront to her name as an opportunity to cash in. Some of her family have mnade comments distancing themselves from the lawsuit. I prefer to believe that Mrs. Parks indeed had nothing to do with the suit, but that still leaves the interesting phenomenon of the disconnect between the generations on this matter.

I think one group of young Black Americans today either are happy to enjoy what their forbears fought so hard for in the 60s and get on with their lives and careers. Another group faces many of the hardships caused by inequities in education and other social services, and find it hard to dwell on the achievements of the 60s considering their present day realities. Popular Black culture reflects both attitudes. When Cedric the Entertainer's character in Barbershop said "All Rosa Parks did was sit her ass down!", like it or not, he was voicing the same irreverent attitude of many young Black Americans. It's not so much that anyone really resent Rosa Parks or any of the other figures skewered in the same scene, but that people tend to laugh perversely when they witness the goring of sacred cows, and in the communities targeted by that movie, there are no sacred cows quite like Civil Rights heros. And of course the hierophant class reacted exactly on cue when Jesse Jackson and Al Sharpton promptly launched a noisy boycott of the movie.

Rosa Parks did much more than just sit down where she was not supposed to, and I don't expect this fact will ever truly be forgotten. Her status as a hero is established rather than undermined by the fact that the youth enjoy spraying graffiti on her pedestal, especially when those who are most ostentatiously serious about her person carry the smell of monetary and political self-interest. For my part, the only thing I've got to say with my Krylon is "Peace, Rosa, and thanks".

[Uche Ogbuji]

via Copia

Cop it while it's hot: 4Suite XML 1.0b2

Updated with working link for the manual

We've announced 4Suite XML 1.0b2 We're a big step towards a 1.0 release, even bigger than most of our releases because what we've done with this is trim the overall 4Suite package to a sensible size and scope. This release contains only the XML processing core and some support libraries. It does not contain the RDF libraries and the repository. This does not mean those components are stranded (see, for example, the rdflib/4RDF merger effort for a sense of the new juice being fed into 4Suite/RDF). It's just that the core XML libraries are so much more mature than the other components, and so much more widely used, that it made no sense not to set it free and let it quickly march to 1.0 on its own. This release features some serious performance improvements, some simplified APIs for a gentler user learning curve, and a lot of fixes and other improvements (see the announcement for the full, long list).

In fact, the code we released is just about 1.0 in itself, as far as the XML component goes. A code freeze is in place, and we'll focus on fixing bugs and wrapping up the user manual effort. (BTW, if you'd like to help chip into the manual, please say so on the 4Suite-dev mailing list; there is a lot of material in place, and what we need is mostly in the way of editing and improving details). Our plan is to get the XML core to 1.0 more quickly than we would have been able before breaking 4Suite into components, and then we can focus on RDF and the repository. 4Suite/RDF will probably disappear into the new rdflib, and the repository will probably go through heavy refactoring and simplification.

Today, after some day-job tasks, my priority will be getting Amara 1.1.5 out. It's been largely ready and waiting for the 4Suite release. Some really sweet improvements and additions in this Amara release (though I do say so myself). More on that later.

[Uche Ogbuji]

via Copia

Fancy Web

OK, so in the past week Dare has flipped the bozo bit on the "Web 2.0" term. Chime has put the term in front of the firing squad. I get the point, already. I've always thought that it's a silly term, but beyond the hype I do think it's useful to have a term for all those emerging aspects of the Web that we did not see a lot of, say, three years ago. Let's be real, a lot about the Web is changing, not necessarily all for the better, but the phenomenon does still need to be described. So for my part, I'll ditch the "Web 2.0" moniker and adopt something more appropriately tongue-in-cheek—"Fancy Web". (I have to stop myself from extrapolating to "Antsy Web", "Chancy Web", "Dancy Web" or the not-very-P.C. "Nancy Web").

BTW, the first person to suggest that it should instead be "FancyWeb" gets the gas-face (n/m).

[Uche Ogbuji]

via Copia

The Buzzword Firing Squad

If buzzword proliferation was a punishable crime, the penitentiaries would be full of software developers and blog authors.

Below is my list of buzzwords, catch-phrases, and technologies that need to be summarily executed without a lengthy trial:

  • Web 2.0 (do I even need a reason?)
  • AJAX (ahem.. The idea of asynchronous HTTP requests for XML content is core to XForms and better architected in that context)
  • SOA / Web Services (90% of the time people use this term they refer to SOAP-based remote procedure invokation specifically)
  • RDF/XML Syntax (This one has done more damage to RDF advocacy than any other)
  • Semantic (This term is so thoroughly abused that it would be par for the course to read of cron job referred to as a 'semantic' process.)
  • "Yes, that's very nice and all, but does it scale?" !*&%#@@$!!!
  • Ontology: I've found it easiest to think of an Ontology as a Taxonomy (or polyhierarchy) with a minimal number of logical constraints. Without logical constraints, there is nothing 'ontological' about a model that could easily be represented in an E-R diagram.

I'm certain this is only about 2% of the full list, I'll be sure to add to it as more come to mind.

[Uche Ogbuji]

via Copia

500 Web feed readers, and none dead on

I started out using straw for reading Web feeds. I found it a bit cumbersome in terms of death by keystroke, having to tap through each entry in each feed. I figured there had to be a better way. Narval (whatever happened to Narval, anyways?) was combining news sources into a virtual newspaper five years ago. I found something the like in Lektora. I quickly found an assortment of annoyances, and I now use a combo of Straw and Lektora. I see that Lektora still barely supports Linux (they have an October 13th release for Windows, June 7th for Mac and the same old March 18th for Linux that I downloaded earlier this year), so it's time to move on.

Parand Tony Darugar suggested Bloglines, and I had a look. In UI, it's just like Lektora, but implemented over the Web rather than as a browser extension. I think it would be perfect except that the over-the-Web functionality makes it rather slow. I've read of a lot of folks who started on Bloglines jumped for Firefox's Sage extension. I'd tried it before and found it to be just Straw in the Web browser, and when I checked again, it's still just that. I'd rather not go back to death by keystroke or mouse click. OK, to be fair, Sage is much less clicky than Straw. It's actually quite clever in how it lays out all the entries for each feed. I just wish it could do that for groups of feeds rather than individual ones.

So it looks as if I'm headed for Bloglines, but first of all I'd like to throw out a lazy-Web check for any other suggestions. I'm up for any browser-based or Linux tool. I'm willing to pay (within reason) for a really solid tool. I prefer the newspaper-like format (if you couldn't tell), where I can just group my Web feeds and then open each group together and mostly just have to scroll down to read all updated entries in that group. I have scrolled through the bewildering array in the RSS Compendium and tried some of them, but the ones I tried didn't really impress me.

Any ideas, friends? TIA.

[Uche Ogbuji]

via Copia

Processing "Web 2.0" using XSLT document() variants? No thanks.

Mark Nottingham has written an intriguing piece "XSLT for the Rest of the Web". It's drummed up some interest, some of which has even leaked into the 4Suite mailing list thanks to the energetic Sylvain Hellegouarch. Mark says:

I’ve raved before about how useful the XSLT document() function is, once you get used to it. However, the stars have to be aligned just so to use it; the Web site can’t use cookies for anything important, and the content you’re interested in has to be available in well-formed XML.

He goes on to present a set of extension functions he's created for libxslt. They are basically smarter document() functions that can do fancy Web things, including HTTP POST, and using HTML Tidy to grab tag soup HTML as XHTML.

As I read through it, I must say my strong impression was "been there, done that, probably never looking back". Certainly no diss of Mark intended there. He's one of the sharper hackers I know. I guess we're just at different points in our thinking of where XSLT fits into the Web-savvy apps toolkit.

First of all, I think the Web has more dragons than you could easily tame with even the mightiest XSLT extension hackery. I think you need general-purpose programming language to wrangle "Web 2.0" without drowning in tears.

More importantly, if I ever needed XSLT's document() function to process anything more than it's spec'ed to, I would consider that a pretty strong indicator that it's time to rethink part of my application architecture.

You see, I used to be a devotee of XSLT all over the place, and XSLT extensions for just about every limitation of the language. Heck, I wrote a whole framework of such things into 4Suite Repository. I've since reformed. These days I take the pipeline approach to such processing, and I keep XSLT firmly in the narrow niche for which it was designed. I have more on this evolution of thinking in "Lifting XSLT into application domain with extension functions?".

But back to Mark's idea. I actually implemented 4Suite XSLT extensions to use HTTP POST and to tidy tag soup HTML into XHTML, but I wouldn't dream of using these extensions any more. Nowadays, I use Python to gather and prepare data into a model representation that I then hand over to XSLT for pure presentation processing. Complex logical tasks such as accessing Web data beyond trivially fetched XML are matters for the model layer, and not the presentation logic. For example, if I need to tidy something, I tidy it at the Python level and put what I need of the resulting XHTML into the model XML before passing it to XSLT. I use Amara XML Toolkit with John Cowan's TagSoup for my tidying needs. I prefer TagSoup rather than tidy because I find it's faster and more robust.

Even if you use the libxml2 family of tools, I still think it's better to use libxml, and perhaps the libxml HTML parser to do the model processing and hand over resulting XML to libxslt in a separate step.

XSLT is pretty cool, but these days rather than reproduce all of Python's dozens of Web processing libraries therein, I plump for Python itself.

[Uche Ogbuji]

via Copia