Copia

“Mix and match Web components with Python WSGI”

Subtitle: Learn about the Python standard for building Web applications with maximum flexibility
Synopsis: Learn to create and reuse components in your Web server using Python. The Python community created the Web Server Gateway Interface (WSGI), a standard for creating Python Web components that work across servers and frameworks. It provides a way to develop Web applications that take advantage of the many strengths of different Web tools. This article introduces WSGI and shows how to develop components that contribute to well-designed Web applications.

Despite the ripples in the Python community over Guido's endorsement of Django (more on this in a later posting), I'm not the least bit interested in any one Python Web framework any more. WSGI has set me free. WSGI is brilliant. It's certainly flawed, largely because of legacy requirements, but the fact that it's so good despite those flaws is amazing.

I wrote this article because I think too many introductions to WSGI, and especially middleware, are either too simple or too complicated. In line with my usual article writing philosophy of what could I have read when I started out to make me understand this topic more clearly, I've tried to provide sharp illustration of the WSGI model, and a few clear and practical examples. The articles I read that were too simple glossed over nuances that I think should really be grasped from the beginning (and are not that intimidating). In the too-complicated corner is primarily PEP 333 itself, which is fairly well written, but too rigorous for an intro. In addition, I think the example of WSGI middleware in the PEP is very poor. I'm quite proud of the example I crafted for this article, and I hope it helps encourage more people to create middleware.

I do want to put in a good word for Ian Bicking and Paste. He has put in tireless effort to evangelize WSGI (it was his patient discussion that won me over to WSGI). In his Paste toolkit, he's turned WSGI's theoretical strengths into readily-available code. On the first project I undertook using a Paste-based framework (Pylons), I was amazed at my productivity, even considering that I'm used to productive programming in Python. The experience certainly left me wondering why, BDFL or no BDFL, I would choose a huge mega-framework over a loosely-coupled system of rich components.

[Uche Ogbuji]

via Copia

XML Universal names (namespaces): To fuse or not to fuse

I ran into Ken MacLeod on the Atom IRC channel today. Actually I think I've chatted with him before but I didn't know the nick I was responding to was Ken. Certainly a fortuitous discovery, but more importantly Ken drew my attention to an old Weblog posting I'd somehow missed. In it he makes two separate points that I think overrun in perhaps a confusing way. Firstly he advocates that XML APIs strictly treat a node's universal (i.e. namespace-qualified) name as a tightly bound unit. I (arbitrarily) call this fusing the universal name. An example is APIs such as ElementTree that use James Clark notation for namespaces.

The second point is that some XML APIs have really ungainly syntax for handling namespaces, with SAX and DOM being the worst offenders (we both agree that the Java/IDL heritage of these APIs is the worst problem).

To take the first point first, I disagree that APIs based on fused names are superior. Yes an XML universal name should conceptually be a unit, but in practice it is not, and people often have a real need to work severally with either piece of the tuple. I used the analogy of complex numbers in the IRC discussion. The complex number 3 + 2j is a single number, but there is nothing wrong with an API's making it easy for a user to manipulate its real and imaginary parts. It's up to the developer not to somehow abuse this flexibility.

The second point is well taken, but I strongly believe that it is not the lack of fused names that makes an API poor. SAX (SAX 2, to be precise) is poor because of the redundancy and odd structural conventions in reporting information such as prefixes. DOM (DOM Level 2, to be precise) is poor because of the bewildering decision to maintain namespace declarations as separate attribute objects in addition to the redundant information offered as node object properties. Both owe much of their weakness to concerns for backwards compatability with non-namespace-aware APIs.

For the most part APIs that were born and bred in the era of namespaces are much less tortured, regardless of how tightly or loosely they expose the parts of a universal name. If anything, I believe that it's better for the API to make it easy for the developer to separate namespace name, local name and prefix. Yes, even prefix, because the golden world in which prefixes are irrelevant does not exist. It was ruined by the advent of QNames in content, or "hidden namespaces". Yes, this happens to be one negative side-effect of the success of XSLT and XPath, which were great specs for the most part, but also represented the first real triumph of the hidden namespaces idea that has left such a mess in its wake.

Ken ended his Weblog post with a proposed notation for fused names, which is just like one I had mulled and discarded for Amara, He also showed me a derived convention for his Orchard software, which I didn't know was still in development (it looks very interesting). This convention looks just like the mapping accessors I somewhat grudgingly added to Amara in February. I'm still open to changing this until Amara 1.2, so I'll give the whole matter some thought (including the Orchard approach). Feedback is welcome. I think the fact that Ken and I independently came up with a such a series of similar ideas makes me think we're on the right track.

I do want to make sure it's clear that giving the user a better API for namespaces is not bound to insistence on fused names.

[Uche Ogbuji]

via Copia

Why Ruby doesn't interest me

Kudos to Ruby boosters for building awareness of the language. It's now truly inescapable. That's undoubtedly a good thing. The fewer languages-that-don't-suck people are aware of, the better. For my part, though I've never been the slightest bit interested. I've also not really had any occasion to express my non-interest. I have had colleagues push me hard to look at Ruby, and once or twice I've done so,. I've found that it's definitely a nice language, but not a big enough improvement over Python to be worth the effort, and just annoying enough in some ways to discourage me from making any extra effort to soak it in.

Today I ran across a Weblog posting with many points that echoed my attitude and I figured, what the hell, that's good for a "co-sign".

In "Why I Like Python more than Ruby" Mark Ramm writes:

Don’t get me wrong, I like Ruby. And it’s not particularly difficult to read. But the philosophy of the language designers led to design choices that emphasize writability over readability. And in that department I think the advantage has to go to Python. Python lists are easy to use, but more importantly I understood all of the list methods and how to use them in the matter of a few min. Perhaps Ruby’s arrays are more powerful that Python lists, but so far I’ve yet to find something that can be done in Ruby that can’t be done easily in Python.

As I think about it even the things people complain about in Python like the explicit self, or significant whitespace are designed to with readability in mind. [emphasis mine]

I think possibly the only Python complaint that really resonates with me is the raggedness of its scoping. Nested scopes and closures help, but there are still patchy spots (advantage Ruby, I hear, thanks to blocks). I'm sure I'm forgetting other Python annoyances, but oft-cited complaints such as the GIL, Unicode API, explicit self and significant whitespace don't bother me one bit.

In comments Karl Guertin says:

In order to get me to switch languages, I have to get some sort of big advantage to make up for the loss of expertise. As a long time Python programmer, I’ve never felt compelled to use Ruby. I’ve read through why’s guide and poked at Rails, but in the end I don’t come up with enough to merit a switch. I’m sure a number of Ruby programmers are in the same boat for Python, and I think that’s fine.

The languages I’m currently investigating (Concurrent Haskell, Erlang, Clean) provide strong concurrency support, which I believe is the next frontier of programming due to the upcoming multi-core machines. In the Python community, Stackless provides concurrency primitives though I don’t think it can take advantage of multiple cores. Is there a concurrency effort in Ruby? All the Ruby news I hear gets drowned out by Rails.

Yeah. Rails seems great as far as it goes, but I'm not even much for Web frameworks in Python, let alone deciding to switch language just to use a framework. I'd be just as likely to start loving Java because Eclipse is neat.

I'm by no means wedded to Python, but if I do make the switch away, it would be to a truly fresh language, not just an incremental change of view.

[Uche Ogbuji]

via Copia

Updating Metacognition Software Stack

Metacognition was down for some time as I updated the software stack that it runs on (namely 4Suite and RDFLib). There were some core changes:

4Suite repository was reinitialized using the more recent persistence drivers.
I added a seperate section for archived publications (I use Google Reader's label service for copia archives)
I switched to using RDFLib as the primary RDF store (using a recently written maping between N3/FOL and a highly efficient SQL schema) and the filesystem for everything else
I added the SIOC ontology to the core index of ontologies
Updated my FOAF graph and Emeka's DOAP graph

Earlier, I wrote up the mapping in a formal notation (using MathML so it will only be viewable in a browser that supports it - like firefox) that RDFLib's FOPLRelationalModel was based on.

In particular, it's incredibly more responsive and better organized. Generally, I hope for it to serve two purposes: 1) Organize my thoughts on and software related to applying Semantic Web Technologies to 'closed' content management systems 2) Serve as a (markdown-powered) whiteboard and playground for tools / demos for advocacy on best practices in problem solving with these technologies.

Below is a marked-up diagram of some of these ideas.

The publications are stored in a single XML file and are rendered (run-time at the server) using a pre-compiled XSLT stylesheet against a cached Domlette. Internally the document is mapped into RDF persistence using an XSLT document definition associated with the document so all modifications are synched into an RDF/XML equivalent.

Mostly as an academic exercise - since the 4Suite repository (currently) doesn't support document definitions that output N3 and the content-type of XSLT transform responses is limited to HTML/XML/Text - I wrote an equivalent publications-to-n3.xslt. The output is here.

Chimezie Ogbuji

via Copia

SPARQL BisonGen Parser Checked in to RDFLib

[by Chimezie Ogbuji]

This is basically an echo of my recent post to the rdflib mailing list (yes, we have one now).

I just checked in the most recent version of what had been an experimental, BisonGen SPARQL parser for RDFLib. It parses a SPARQL query into a set of Python objects representing the components of the grammar:

The parser itself is a Python/C extension (but the BisonGen grammar could be extended to incorporate Java callbacks instead), so the setup.py had to be modified in order to compile it into a Python module. The BisonGen files themselves are:

SPARQL.bgen (the main file that includes the others)
SPARQLTurtleSuperSet.bgen.frag (the second part of the grammar which focuses on the variant of Turtle that SPARQL uses)
SPARQLTokens.bgen.frag (Token definitions)
SPARQLLiteralLexerPatterns.bgen.frag (Keywords and 'literal' tokens)
SPARQLLexerPatterns.bgen.frag (lexer definition)
SPARQLLexerDefines.bgen.frag (the lexer patterns themselves)
SPARQLParser.c (the generated parser)

Theoretically, the second part of the grammar dedicated to the Turtle syntax could be broken out into seperate Turtle/N3 parsers which could be built in to RDFLib, removing the current dependency on n3p

I also checked in a test harness that's meant to work with the DAWG test cases:

I'm currently stuck on this particular test case, but working through it. For the most part a majority of the grammar is supported except mathematical expressions and certain case-insensitive variations on the SPARQL operators.

The test harness only checks for parsing, it doesn't evaluate the parsed query against the corresponding set of test data, but can be easily be extended to do so. I'm not sure about the state of those test cases, some have been 'accepted' and some haven't. In addition, I came across a couple that were illegal according to the most recent SPARQL grammar (the bad tests are noted in the test harness). Currently the parser is stand-alone, it doesn't invoke sparql-p for a few reasons:

I wanted to get it through parsing the queries in the test case
Our integrated version of sparql-p is outdated as there is a more recent version that Ivan has been working on with some improvements that should probably be considered for integration
Some of the more complex combinations of Graph Patterns don't seem solvable without re-working / extending the expansion tree solver. I have some ideas about how this could be done (to handle things like nested UNIONS and OPTIONALs) but wanted to get a working parser in first

Using the parser is simple:

from rdflib.sparql.bison import Parse
p = Parse(query,DEBUG)
print p

p is an instance of rdflib.sparql.bison.Query.Query

Most of the parsed objects implement a __repr__ function which prints a 'meaningful' representation recursively down the hierarchy to the lower level objects, so tracing how each __repr__ method is implemented is a good way to determine how to deconstruct the parsed SPARQL query object.

These methods could probably be re-written to echo the SPARQL query right back as a way to

Test round-tripping of SPARQL queries
Create SPARQL queries by instanciating the rdflib.sparql.bison.* objects and converting them to strings

It's still a work in progress, but I think it's far enough through the test cases that it can handle most of the more common syntax.

Working with BisonGen was a good experience for me as I hadn't done any real work with parser generators since my days at the University of Illinois (class of '99'). There are plenty of good references online for the Flex pattern format as well as Bison itself. I also got some good pointers from AndyS and EricP on #swig.

It also was an excellent way to get familiar with the SPARQL syntax from top to bottom, since every possible nuance of the grammar that may not be evident from the specification had to be addressed. It also generated some comments on inconsistencies in the specification grammar that I I've since redirected to public-rdf-dawg-comments

Chimezie Ogbuji

via Copia

Cleveland Clinic Job Posting for Data Warehouse Specialist

Cleveland Clinic Foundation has recently posted a job position for a mid-level database developer with experience in semi-structured data binding, transformation, modeling, and querying. I happen to know that experience with the following are big pluses:

Python
XML data binding
XML / RDF querying (SPARQL,XPath,XQuery,etc..)
XML / RDF modelling
Programming database connections
*NIX System administration
Web application frameworks

You can follow the above link to the post an application.

Chimezie Ogbuji

via Copia

"Tip: Remove sensitive content from your XML samples with XSLT"

Do you need to share samples of your XML code, but can't disclose the data? For example, you might need to post a sample of your XML code with a question to get some advice with a problem. In this tip, Uche Ogbuji shows how to use XSLT to remove sensitive content and retain the basic XML structure.

I limited this article to erasing rather than obfuscating sensitive content, which can be done with XSLT 1.0 alone. With EXSLT (or even XSLT 2.0) you can do some degree of obfuscation, allowing you to possibly preserve elements of character data that are important to the problem under discussion. Honestly, though, I prefer to solve this problem with even more flexible tools. As a bonus the following is a bit of 4Suite/SAX code that uses a SAX filter to obfuscate character data by adding a random shift to the ordinal of each character in the Unicode alphanumeric class. This way if exotic characters were part of the problem you're demonstrating, they'd be left alone. It's easy to use the code as a template, and usually all you have to change is the obfuscate function or the obfuscate_filter class in order to fine-tune the workings.

import re
import random
from xml.sax import make_parser, saxutils
from Ft.Xml import CreateInputSource, Sax

RANDOM_AMP = 15
ALPHANUM_PAT = re.compile('\w', re.UNICODE)

def obfuscate(old):
    def mutate(c):
        return unichr(ord(c.group())+random.randint(-RANDOM_AMP,RANDOM_AMP))
    return ALPHANUM_PAT.subn(mutate, old)[0]

class obfuscate_filter(saxutils.XMLFilterBase):
    def characters(self, content):
        saxutils.XMLFilterBase.characters(self, obfuscate(content))
        return

if __name__ == "__main__":
    XML = "http://cvs.4suite.org/viewcvs/*checkout*/Amara/demo/labels1.xml"
    parser = make_parser(['Ft.Xml.Sax'])
    filtered_parser = obfuscate_filter(parser)
    handler = Sax.SaxPrinter()
    filtered_parser.setContentHandler(handler)
    filtered_parser.parse(CreateInputSource(XML))

This code uses recent fixes and capabilities I checked into 4Suite CVS last week. I think all the needed details to understand the code are in the SAX section of the updated 4Suite docs, which John Clark has already posted.

[Uche Ogbuji]

via Copia

Four Mozilla/XML bugs to vote on (or to help with)

In a recent conversation with colleagues some of the limitations of XML processing in Mozilla came up. I think some of these are really holding Mozilla and Firefox back from being a great platform for XML processing, and so I wanted to highlight them here. Remember that the key to bringing attention to an important bug/request is to vote for it in the tracker, so please consider doing so. I already have done.

18333: "XML Content Sink should be incremental". The description says it all:

Large XML documents, such as the W3C's XSLT spec, take an incredibly long time to load into view source. The browser freezes/blocks (is "not responding" according to Windows) while it processes, and finally unlocks after the entire source of the document is ready for display.

Firefox will never really be a friendly platform for XML processing until this is addressed. There is not really a problem addressing this using the Mozilla's underlying parser, Expat. Worst case one could use that parser's suspend/resume facility (we recently took advantage of this to allow Python-generator-based access to 4Suite Saxlette parsing). The real issue is the amount of work that would need to be done across the Mozilla code base. Unfortunately, Mozilla insiders have been predicting a fix for this problem for a while, and unless there's a sudden boost in votes or better yet resources to help fix the problem, I'm not feeling very optimistic.

69799: "External entities are not included in XML document". Using Betty Harvey's example,

<!DOCTYPE myXML[
<!ENTITY extFile SYSTEM "extFile.xml">
]>
<myXML>&extFile;</myXML>

Is rendered as if Mozilla read

<myXML></myXML>

Of course you have to watch out for XSS type attacks, but I imagine Mozilla could handle this the same way it does loaded stylesheets: by restricting to same host domain as the document entity.

193678: "support exslt:common". The node-set extension function is pretty much required for complex XSLT processing, so support from Mozilla would really help open up the landscape of what you can do with XSLT in the browser.

98413: "Implement XML Catalogs". A request to implement OASIS Open XML Catalogs. This could do a lot to encourage support for external entities because some performance problems could be reduced by using a catalog to load a local version of the resource.

A few on my personal would-be-nice-to-fix-but-not-essential list are:

256430: "XML parsing errors need a storyline". A really bad title for the fact that RDF/XML is not checked for well-formedness in Mozilla so that e.g. RSS 1.0 Web feeds could be non-WF. Bad. Bad.
293347: "Firefox crashes when open xml with associated xsl". Basically only an issue when calling XSLT from Javascript. See also 202765: "Crash when doing document.write[ln] in an XSLT stylesheet [@ txMozillaXMLOutput::endHTMLElement]" and 315799: "Viewing an XML document which produces JavaScript via XSLT produces blank page".

Schematron creeping on the come-up (again)

Schematron is the boss XML schema language your boss has never heard of. Unfortunately it's had some slow times of recent, but it's surged back with a vengeance thanks to honcho Rick Jelliffe with logistical support from Betty Harvey. There's now a working mailing list and a Wiki. Rick says that Schematron is slated to become an ISO Standard next month.

The text for the Final Draft Internation Standard for Schematron has now been approved by multi-national voting. It is copyright ISO, but it is basically identical to the draft at www.schematron.com

The standard is 30 pages. 21 are normative, including schema listings and a characterization of Schematron semantics in predicate logic. Appendixes show how to use other query language bindings (than XSLT1), how to use Schematron as a vocabulary, how to express multi-lingual dignostics, and a simple description of the design requirements for ISO Schematron.

Congrats to Rick. Here's to the most important schema language of them all (yes, I do mean that). I guess I'll have to check Scimitar, Amara's fast Schematron processor for compliance to the updated draft standard.

[Uche Ogbuji]

via Copia

Mystery of Google index drop solved?

Update. Corrected Christian's surname. Sorry, man.

A while ago I complained that uche.ogbuji.net disppeared from Google search results soon after I went to a CherryPy-based server. I'm up to say that I'm a goof, but I hope that admitting my silly error might save someone else some head-scratching (maybe even this gentleman)

I'm at least not alone in my error. The clue came from this message by the very smart Christian Wyglendowski In my case I was getting 404s for most things, but I did have a bug that was causing a 500 error on requests to robots.txt. Apparently the Google bot shuns sites with that problem. I can understand that but it's interesting that Yahoo doesn't seem to do the same thing, since my ranking didn't drop much there. I fixed the bug and then submitted a reinclusion request to Google following the suggestions in this article (I guess SEO advice isn't a completely parasitic endeavor). The body of my message was as follows:

I had a bug causing 500 error on robots.txt request, and I think that's why I got dropped from your index. I've fixed that bug, and would like to request reinclusion to your index. Thanks.

We'll see if that does the trick.

[Uche Ogbuji]

via Copia

Copia

Ogbujis on an abundance of topics

Tag python

“Mix and match Web components with Python WSGI”

XML Universal names (namespaces): To fuse or not to fuse

Why Ruby doesn't interest me

Updating Metacognition Software Stack

SPARQL BisonGen Parser Checked in to RDFLib

Cleveland Clinic Job Posting for Data Warehouse Specialist

"Tip: Remove sensitive content from your XML samples with XSLT"

Four Mozilla/XML bugs to vote on (or to help with)

Schematron creeping on the come-up (again)

Mystery of Google index drop solved?