Copia

Triclops gets a facelift, new query management capabilities, and new APIs

by Chimezie Ogbuji

I recently had a need to manage a set of queries against an OWL2 EL biomedical ontology: the Foundational Model of Anatomy. I have an open source SPARQL service implementation that I had some thoughts about extending with support for managing queries. It’s called Triclops and is part of a collection of RDF libraries and tools I have been accumulating. The name is a reference to an initial attempt to build an RDF querying and navigation interface as part of the 4Suite repository back in the day (circa 2002).

This later evolved to a very rudimentary web interface that sat in front of the Oracle 11g and MySQL/SPARQL patient dataset that Cyc’s SKSI interacted with. This was part of an interface tailored to the task of identifying patient cohorts, known as the Semantic Research Assistant (SRA). A user could dispatch handwritten SPARQL queries, browse clickable results, or return them as CSV files. This capability was only used by informaticians familiar with the structure of the RDF dataset and most investigators used the SRA.

It also implemented a RESTful protocol for ticket-based querying that was used for stopping long-running SPARQL/MySQL queries. This is not currently documented. Around the time this was committed as an Apache-licensed, Google code library, layercake-python added core support for APIs that treated remote SPARQL services as local Graph objects as well as general support for connecting SPARQL services. This was based on Ivan Herman’s excellent SPARQL Endpoint interface to Python.

Triclops (as described in the wiki) can now be configured as a “Proxy SPARQL Endpoint”. It can be deployed as a light-weight query dispatch, management, and mediation kiosk for remote and local RDF datasets. The former capability (dispatching) was already in place, the latter (mediation) can be performed using FuXi’s recent capabilities in this regard.

Specifically, FuXi includes an rdflib Store that uses its sideways-information passing (sip) strategies the in-memory SPARQL algebra implementation for use as a general-purpose framework for semweb SPARQL (OWL2-RL/RIF/N3) entailment regimes. Queries are mediated over the SPARQL protocol using global schemas captured as various kinds of semweb ontology artifacts (expressed in a simple Horn form) that describe and distinguish their predicates by those instantiated in a database (or factbase) and those derived via the semantic properties of these artifacts.

So the primary capability that remained was query management and so this recent itch was scratched over the holidays. I discovered that CodeMirror , a JavaScript library that can be used to create a relatively powerful editor interface for code, had excellent support for SPARQL. I integrated it into Triclops as an interface for managing SPARQL queries and their results. I have a running version of this at http://metacognition.info/sparql/queryMgr. Note, the service is liable to break at any point as Webfaction kills of processes that use up alot of CPU and I have yet to figure out how to configure it to restart the service when it dies in such a fashion.

The dataset this interface manages queries for is a semantic web of content comprising 3 of the primary, ancient Chinese, classical texts (the Analects, Doctrine of the Mean, and the Tao Te Ching). I record the information in RDF because it is an intuitive knowledge representation to use in capturing provenance, exposition, and other editorial meta data. Below is a screen shot of the main page listing a handful of queries, their name, last date of modification, date of last run, and number of solutions in the recent result.

Main SPARQL service page

Above the list is a syntax-highlighted text area for dispatching adhoc SPARQL queries. This is where CodeMirror is integrated. If I click on the name of the query titled “Query for Analects and the Doctrine of the Mean english chapter text (Confucius)”, I go to a similar screen with another text area whose content corresponds to the text of the query (see the screen shot below).

Main SPARQL service page

From here queries can be updated (by submitting updated CodeMirror content) or cloned (using the name field for the new copy). Alternatively, the results of previous queries can be rendered. This sends back a result document with an XSLT processing instruction that causes the browser to trigger a request for a stylesheet and render an XHTML document from content in the result document on the client side.

Finally, a query can be re-executed against a dataset, saving the results and causing the information in the first screen to show different values for the last execution run (date and number of solutions). Results can also be saved or viewed as CSV using a different stylesheet against the result document.

The last capability added is a rudimentary template system where any variable in the query or text string of the form ‘$ …. $’ is replaced with a provided string or a URI. So, I can change the pick list value on the second row of the form controls to $searchExpression$ and type “water”. This produces a SPARQL query (visible with syntax highlighting via CodeMirror) that can be used as a template to dispatch queries against the dataset.

In addition, solutions for a particular variable can be used for links, providing a small framework for configurable, navigation workflows. If I enter “[Ww]ater” in the field next to $searchExpression$, select classic from the pick list at the top of the Result navigation template area, pick “Assertions in a (named) RDF graph” from the next pick list, and enter the graphIRI variable in the subsequent text input.

Triggering this form submission will produce the result screen pictured below. As specified in the form, clicking any of the the dbpedia links for the Doctrine of the Mean will initiate the invokation of the query titled “Assertions in a (named) RDF graph”, and shown below (with the graphIRI variable pre-populated with the corresponding URI):

SELECT DISTINCT ?s ?p ?o where {
    GRAPH ?graphIRI {
      ?s ?p ?o
    }
}

Main SPARQL service page

The result of such an action is shown in the screen shot. Alternatively, a different subsequent query can be used: “Statements about a resource”. The relationship between the schema of a dataset and the factbase can be navigated in a similar way. Picking the query titled “Classes in dataset” and making the following modifications. Select “Instances of a class and graph that the statements are asserted in” from the middle pick list of the Result navigation template section. Enter ?class in the text field to the right of this. Selecting ‘Execute..’ and executing this query results in a clickable result set comprised of classes of resources and clicking any such link shows the instances of that class.

Main SPARQL service page

This latter form of navigation seems well suited for exploring datasets for which either there is no schema information in the service or it is not well known by the investigator writing the queries.

In developing this interface, at least 2 architectural principles were re-used from my SemanticDB development days: the use of XSLT on the client side to build rich, offloaded (X)HTML applications and the use of the filesystem for managing XML documents rather than a relational database. The latter (use of a filesystem) is particularly more relevant where querying across the documents is not a major requirement or even a requirement at all. The former is via setting the processing instruction of a result document to refer to a dynamically generated XSLT document on the server.

The XSLT creates a tabular, row-distinguishing, tabular interface where the links to certain columns trigger queries via a web API that takes various input, including: the variable in the current query whose solutions are ‘streamed’, a (subsequent) query specified by some function of the MD5 hash of its title, a variable in that query that is pre-populated with the corresponding solution, etc:

../query=...&action=update&innerAction=execute,templateValue=...,&valueType=uri&variable=..

Eventually, the API should probably be made more RESTful and target the query, possibly leveraging some caching mechanism in the process. Perhaps it can even work in concert with the SPARQL 1.1 Graph Store HTTP Protocol.

Using Amara's pushtree for heavyweight XML processing in GRDDL and SPARQL querying

I’ve been using Amara to address my high throughput needs for Extract Transform Load (ETL), querying, and processing of large amounts of RDF. In one particular part of the larger process, I needed to be able to stream very large XML documents in a particular dialect into RDF/XML. I sent an email to the akara google group describing the challenges and my thoughts behind wanting to use a streaming XML paradigm rather than XSLT.

I basically want to leverage Amara’s pushtree and its use of coroutines as a minimal-overhead pipeline for dispatching events triggered by elements in the source XML, where the source XML is a GRDDL source document and the pushtree coroutine is the transformation property. That task is still a work in progress, in the interest of expedience I went forward and used XSLT but need to try out some of what Uche suggested in the end.

The other part where I have made much more progress is in streaming results to SPARQL queries (against a SPARQL service) into a CSV file via command-line and with minimal overhead (also using Amara, pushtree, and coroutines). A recent set of changes to layercake-python modified the sparqler command-line to add an —endpoint option which takes a SPARQL service URL. Other changes were made to the remote SPARQL service store to support this.

Also added, was a new sparqlcsv script:

$ sparqlcsv --help
Usage: sparqlcsv [options] [SPARQLXMLFilePath]
Options:
 -h, --help            show this help message and exit
 -q QUOTECHAR, --quoteChar=QUOTECHAR
                       The quote character to use
 -c, --count           Just count the results, do not serialize to CSV
 -d DELIMITER, --delimiter=DELIMITER
                       The delimiter to use
 -o FILEPATH, --output=FILEPATH
                       The path where to write the resulting CSV file

This script takes a SPARQL XML file either from the file indicated as the first argument or from STDIN if none is specified and writes out a CSV file to STDOUT or to a file. The general architectural idea is to build a bash pipeline from the SPARQL service to a CSV file (and eventually into a relational database for more sophisticated analysis) or to STDOUT for subsequent processing along the pipeline.

So, now I can run a query against Virtuoso and stream the CSV results into a file (with minimal processing overhead):

$ sparqler --owl=..OWL file.. --ns=..prefix..=..URL.. \
           --endpoint=..SPARQL service URL.. \
"SELECT ... { ... }" | sparqlcsv | .. subsequent processong ..

Where the namespaces in the OWL/RDF file (provided by the —owl option) and those given explicitly via the —ns option are added as namespace prefix definitions at the top of the SPARQL query that is dispatched to the remote SPARQL service located via the URL provided to the —endpoint option. Alternatively, the -o option can be used to specify a filename where the CSV content is written to.

The sparqlcsv script uses a pushtree coroutine to stream XML content into a CSV file in this way:

def produce_csv(doc,csvWriter,justCount):
   cnt=Counter()
   @coroutine
   def receive_nodes(cnt):
       while True:
           node = yield
           if justCount:
               cnt.counter+=1
           else:
               rt=[]
               badChars = False
               for binding in node.binding:
                   try:
                       rt.append(U(binding).encode('ascii'))
                   except UnicodeEncodeError:
                       rt.append(U(binding).encode('ascii', 'ignore'))
                       badChars = True
                       print >> sys.stderr, "Skipping character", U(binding)
               if badChars:
                   cnt.skipCounter += 1
               csvWriter.writerow(rt)
       return
   target = receive_nodes(cnt)
   pushtree(doc, u'result', target.send, entity_factory=entity_base)
   target.close()
   return cnt

Where doc is an XML document (as a string), csvWriter is an instance of the Writer Object, and the last parameter indicates whether or not only the size of the solution sequence is returned rather than the resulting CSV.

Finding URLs in plain text

John Gruber put in some good work to derive and test a regex to extract URLs from plain text.

"An Improved Liberal, Accurate Regex Pattern for Matching URLs"

I needed to use it today and found it needs a bit of care to translate for use in Python, especially with regard to its Unicode characters. Here is my Python version, with a super-simple harness to use Gruber's test page:

I'm not entirely sure I've translated the original with 100% fidelity, but this has worked fine for my purposes. I'm open to tweaks or suggestions, and will keep the Gist updated.

Numerical type with units - via Python Cookbook

I implemented dimensions.py perhaps eight years ago as an exercise and have used it occasionally ever since.

It allows doing math with dimensioned values in order to automate unit conversions (you can add m/s to mile/hour) and dimensional checking (you can't add m/s to mile/lightyear). It specifically does not convert 212F to 100C but rather will convert 9F to 5C (valid when converting temperature differences).

It is similar to unums (http://home.scarlet.be/be052320/Unum.html) but with a significant difference:

I used a different syntax Q(25,'m/s') as opposed to 100*m/s (I recall not wanting to have all the base SI units directly in the namespace). I'm not entirely sure which approach is really better.

I also had a specific need to have fractional exponents on units, allowing the following:
>>> km=Q(10,'N*m/W^(1/2)')
>>> km
Q(10.0, 'kg**0.5*m/s**0.5')
Looking back I see a few design decisions I might do differently today, but I'll share it anyway.

Some examples are in the source below the line with if __name__ == "__main__":

Note that I've put two files into the code block below, dimensions.py and dimensions.data, so please cut them apart if you want to try it.

via code.activestate.com

Very impressive library. I recently incorporated the use of the Measurement Unit Ontology into the Computer-based Patient Record (CPR) ontology and (on the surface) it seems like a library like this can provide the unit conversion machinery for RDF instances that use such a framework.

SNOMED-CT Management via Semantic Web Open Source Tools Committed to Google Code

[by Chimezie Ogbuji]

I just committed my working copy of the set of tools I use to manipulate and serialize SNOMED-CT (the Systematized Nomenclature of Medicine) and the Foundational Model of Anatomy (FMA) as OWL/RDF for use in the clinical terminology research I’ve been doing lately. It is still in a very rough form and probably not usable by anyone other than a Python / Semantic Web hacker such as myself. However, I’m hoping to get it to a shape where it can be used by others. I had hesitated to release it mostly because of my concerns around the SNOMED-CT license, but I’ve been assured that as long the hosting web site is based in the united states and (most importantly) the software is not released with the SNOMED distribution it should be okay.

I have a (mostly empty) Wiki describing the command-line invocation. It leverages InfixOWL and rdflib to manipulate the OWL/RDF. Basically, once you have loaded the delimited distribution into MySQL (the library also requires MySQLdb and an instance of MySQL to work with), you can run the command-line, giving it one or more list of SNOMED-CT terms (by their identifiers) and it will return an OWL/RDF representation of an extract from SNOMED-CT around those terms.

So, below is an example of running the command-line to extract a section around the term Diastolic Hypertension and piping the result to the FuXi commandline in order to select a single class (sno:HypertensiveDisorderSystemicArterial) and render it using (my preferred syntax for OWL: the Manchester OWL syntax):

$python ManageSNOMED-CT.py -e 48146000 -n short -s localhost -u ..mysql username.. --password=..mysql password.. -d snomed-ct | FuXi --ns=sno=tag:info@ihtsdo.org,2007-07-31:SNOMED-CT# --output=man-owl --class=sno:HypertensiveDisorderSystemicArterial --stdin
Class: sno:HypertensiveDisorderSystemicArterial
    ## Primitive Type (Hypertensive disorder) ##
    SNOMED-CT Code: 38341003 (a primitive concept)
    SubClassOf:
              Clinical finding
              Disease
              ( sno:findingSite some Systemic arterial structure )

Which renders an expression that can be paraphrased as

‘Hypertensive Disorder Systemic Arterial’ is a clinical finding and disease whose finding site is some structure of the systemic artery.

I can also take the Burn of skin example from the Wikipedia page on SNOMED and demonstrate the same thing, rendering it in its full (verbose) OWL/RDF/XML form:

<owl:Class rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#BurnOfSkin">
  <owl:intersectionOf rdf:parseType="Collection">
    <owl:Restriction>
      <owl:onProperty>
        <owl:ObjectProperty rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#findingSite"/>
      </owl:onProperty>
      <owl:someValuesFrom rdf:resource="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#SkinStructure"/>
    </owl:Restriction>
    <rdf:Description rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#ClinicalFinding"/>
    <owl:Restriction>
      <owl:onProperty>
        <owl:ObjectProperty rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#associatedMorphology"/>
      </owl:onProperty>
      <owl:someValuesFrom rdf:resource="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#BurnInjury"/>
    </owl:Restriction>
    <rdf:Description rdf:about="tag:info@ihtsdo.org,2007-07-31:SNOMED-CT#Disease"/>
  </owl:intersectionOf>
  <rdfs:label>Burn of skin</rdfs:label>
  <skos:scopeNote>disorder<skos:scopeNote>
  <skos:prefSymbol>284196006</skos:prefSmbol>
</owl:Class>

And then in its more palatable Manchester OWL form:

$ python ManageSNOMED-CT.py -e 284196006 -n short -s localhost -u ..username.. --password= -d snomed-ct | FuXi --ns=sno=tag:info@ihtsdo.org,2007-07-31:SNOMED-CT# --output=man-owl --class=sno:BurnOfSkin --stdin
Class: sno:BurnOfSkin
    ## A Defined Class (Burn of skin) ##
    SNOMED-CT Code: 284196006
    EquivalentTo:
      ( sno:ClinicalFinding and sno:Disease ) that
      ( sno:findingSite some Skin structure ) and (sno:associatedMorphology some Burn injury )

Which can be paraphrased as:

A clinical finding or disease whose finding site is some skin structure and whose associated morphology is injury via burn

The examples above use the ‘-n short’ option, which renders extracts in OWL via the short normal form which uses a procedure described in the SNOMED-CT manuals that produces a more canonical representation, eliminating redundancy in the process. It currently only works with the 2007-07-31 distribution of SNOMED-CT but I’m in the process of updating it to use the latest distribution. The latest distribution comes with its own OWL representation and I’m still trying to wrap my head around some quirks in it involving role groups and whether or not this library would need to change so it works directly off this OWL representation instead of the primary relational distribution format. Enjoy,

Indivo X - A Promising Framework for Personal Healthcare Records

Indivo X Alpha 1 released

A few days ago, we released the source code for the first public alpha of Indivo X, our latest vision for personally controlled health records. This is a release focused on the Indivo X API, targeted first at developers. Jump right into the installation instructions. (We don't recommend you use this version in a production environment just yet.)

Fred Trotter wrote up his first impressions, and ZDnet picked up the story on open-source and health-reform. We look forward to feedback from the community, and we're already hard at work on Alpha 2, which we expect to deliver in early Spring.

via indivohealth.org

I recently came across the publicly available codebase for Indivo. I've read quite a bit about Indivo during my research regarding Personal Healthcare Records (PHRs). Indivo seems like the most promising for several reasons. First, it is being released to the public (at least a version of it). One of the reasons I have been really driven to learn more about the PHR market is my belief that the combination of social web, the emerging interests of patients to have more direct access to their healthcare data, and the significant stunting of adoption of contemporary web technologies for medical record systems will be a major catalyst to a new generation of health applications. There is alot of data that supports this trend. Second, the seminal paper by the authors of Indivo captures their vision of how PHRs will change the healthcare data landscape. It is a good read for anyone interested this phenomenon. Their vision appeals to me on a visceral level. There is something about the idea of a healthcare data revolution being sparked by patients themselves and their willingness to adopt value-adding technologies that otherwise their caretakers are perhaps too risk averse to consider that appeals to me.

Looking closely at the code base, I discovered that it is comprised of Python, Django, and Postgres. The SemanticDB patient registry is currently based on 4Suite, Python, a significant amount of XML processing and MySQL. I'm keen on building a simple hello-world PHR for managing my blood pressure readings and medications as a first iteration to see how far I can go with my current toolset: Akara (for the web infrastructure), Amara (for the XML processing), rdflib (for the RDF processing), FuXi (for any logical entailment and query re-writing), and CPR for the medical record ontology.

Indivo also includes a Python implementation of OAuth. I've been doing alot of research regarding how OAuth can be adopted as a cryptographically safe mechanism to delegate (subscribed) access to PHR content (similarly to facebook content subscription).

I will have alot more to say on this general topic. Stay tuned!

Python APIs for the Upper Portions of the SW Layer Cake

[by Chimezie Ogbuji]

So, I haven't written about some recent capabilities I've been developing in FuXi. Ultimately, I would like it to be part of a powerful open-source library for semantic web middle-ware development. The current incarnation of this is python-dlp.

The core inference engine ( a RETE-based production system ) has been stable for some time now. FuXi is comprised of the following major modules:

DLP (An implementation of the DLP transformation)
Horn (an API for RIF Basic Logic Dialect)
Rete (the inference engine)
Syntax (Idiomatic APIs for OWL - InfixOWL)

I've written about all but the 2nd item (more on this in another article) and the last in the above list. I've come full circle on APIs for RDF several times, but I now believe that a pure object RDF model (ORM) approach will never be as useful as a object RDF vocabulary model. I.e., an API built for a specific RDF vocablary. The Manchester OWL syntax has quickly become my syntax of choice for OWL and (at the time) I had been toying with an idea of turning it into an API.

Then one day, I had an exchange with Sandro Hawke on IRC about what a Python API for the OWL Abstract Syntax would look like. I had never taken a close look at the abstract syntax until then and immediately came away thinking something more light-weight and idiomatic would be preferable.

I came across a powerful infix operator recipe for Python:

Infix operators

I wrote an initial, standalone module called InfixOWL and put up a wiki which still serves as decent initial documentation for the syntax. I've since moved it into a proper module in FuXi, fixed some bugs and very recently added even more idiomatic APIs.

The module defines the following top-level Python classes:

Individual - Common class for 'type' descriptor
AnnotatibleTerm(Individual) - Common class for 'label' and 'comment' descriptors
Ontology(AnnotatibleTerm) - OWL ontology
Class(AnnotatibleTerm) - OWL Class
OWLRDFListProxy - Common class for class descriptions composed from boolean operators
EnumeratedClass(Class) - Class descriptions consisting of owl:oneOf
BooleanClass(Class,OWLRDFListProxy) - Common class for owl:intersectionOf / owl:unionOf descriptions
Restriction(Class) - OWL restriction
Property(AnnotatibleTerm) - OWL property

Example code speaks much louder than words, so below is a snippet of InfixOWL code which I used to compose (programmatically) the CPR ontology:

CPR   = Namespace('http://purl.org/cpr/0.75#')
INF   = Namespace('http://www.loa-cnr.it/ontologies/InformationObjects.owl#')
EDNS  = Namespace('http://www.loa-cnr.it/ontologies/ExtendedDnS.owl#')
DOLCE = Namespace('http://www.loa-cnr.it/ontologies/DOLCE-Lite.owl#')
OBI   = Namespace('http://obi.sourceforge.net/ontology/OBI.owl#')
SNAP  = Namespace('http://www.ifomis.org/bfo/1.0/snap#')
SPAN  = Namespace('http://www.ifomis.org/bfo/1.0/span#')
REL   = Namespace('http://www.geneontology.org/owl#')
GALEN = Namespace('http://www.co-ode.org/ontologies/galen#')
TIME  = Namespace('http://www.w3.org/2006/time#')
CYC   = Namespace('http://sw.cyc.com/2006/07/27/cyc/')
XML   = Namespace('http://www.w3.org/2001/04/infoset#')
g = Graph()    
g.namespace_manager = namespace_manager
Class.factoryGraph = g
Property.factoryGraph = g
Ontology.factoryGraph = g

cprOntology = Ontology("http://purl.org/cpr/owl")
cprOntology.imports = ["http://obo.sourceforge.net/relationship/relationship.owl",
                       DOLCE,
                       #EDNS,
                       URIRef("http://obi.svn.sourceforge.net/viewvc/*checkout*/obi/ontology/trunk/OBI.owl"),
                       "http://www.w3.org/2006/time#"]
representationOf = Property(CPR['representation-of'],
                            inverseOf=Property(CPR['represented-by']),
                            domain=[Class(CPR['representational-artifact'])],
                            comment=[Literal("...")])
... snip ...
person = Class(CPR.person,
               subClassOf=[Class(SNAP.Object)])
... snip ...
clinician = Class(CPR.clinician)
clinician.comment=Literal("A person who plays the clinician role (typically Nurse, Physician / Doctor,etc.)")
#This expressio is equivalent to cpr:clinician rdfs:subClassOf cpr:person
person+=clinician    
.. snip ..
patientRecord = Class(CPR['patient-record'])
patientRecord.comment=Literal("an electronic document (a representational artifact [REFTERM])  "+
                               "which captures clinically relevant data about a specific patient and "+
                               " is primarily comprised of one or more cpr:clinical-descriptions.")
patientRecord.seeAlso = URIRef("http://ontology.buffalo.edu/bfo/Terminology_for_Ontologies.pdf")
patientRecord.subClassOf = \
    [bytes,
     #Class(CYC.InformationBearingThing),
     CPR['representation-of'] |only| patient,
     REL.OBO_REL_has_proper_part |some| Class(CPR['clinical-description'])]
... snip ...
problemDisj=Class(CPR['pathological-state']) | organismalOccurrent | extraOrganismalOccurrent
problem = Class(CPR['medical-problem'],
                subClassOf=[problemDisj,
                            realizedBy|only|Class(CPR['screening-act']),                                
                            DOLCE['has-quality'] |some| owlTimeQuality])

After the OWL graph is composed, it can be serialized by simply invoking:

g.serialize(format='pretty-xml')

[Chimezie Ogbuji]

via Copia

First day as a Python/Mac developer

This is primarily just my scattershot notes on getting myself ready for Python and C development on Mac. It really is a confusing picture as to how to get started with Python development on the Mac. You can get a bunch of bits and pieces from the official Mac page for Python , the Python/Mac FAQ and a few other places, but it's hard to put it al together to understand how The OS X bundled Python, MacPython, Fink, MacPorts, framework or non framework, etc. all fit together, and how to navigate the options. It didn't help that important Wiki pages such as the FAQ had been vandalized, and I was not able to fix it for some reason.
It seems to me that the reason for all this confusion is that a person just needing to run some cool Python script they downloaded would go about things in a very different way from someone like me who needs to heavily maintain software that uses advanced Python/C facilities. It all comes down to the split personality that comes from the OS X way of life superimposed upon the UNIX way of life.

Picking a distribution

Also see:

this note about how to use distutils to build a redistributable package MacPython, and these follow-ups: 1 2 3.
This note on Uninstalling MacPython

The key section from the FAQ is the following, pasted from the diff of the vandalized page:

Q: Python overload! I've got Apple's Python, Jack's Python, Fink's Python... A: Newcomers to Python-on-X are often confused by the several distributions of Python available. Each flavor has a history and a reason for existance, but if you're starting out, you probably want to look at the "official unofficial" builds of MacPython 2.4 on http://undefined.org/python and install additional packages like numarray or PIL from http://pythonmac.org/packages. These builds have a feature set that supersedes that of the beloved 'official MacPython builds' by Jack Jansen and solve many of the obstacles that are described by the FAQ entries on this page.

I followed this advice and went with MacPython, but I also set up MacPorts for some flexibility (see below).

Getting started with MacPython (including setuptools)

I grabbed and installed python-2.5-macosx.dmg dmg/python-2.5-macosx.dmg from the page recommended in the FAQ.

I went with the approach of MacPython in system directories, but packages I build from source in my home directory. This meant the following in my ~/.profile, for a start, added after the "# Setting PATH for MacPython 2.5" section added by the MacPython installer.

export PATH=$HOME/bin:/usr/local/bin:$PATH

export PYVERSION=2.5
export PYSITE=$HOME/Library/Python/$PYVERSION/site-packages
export PYTHONPATH=$PYSITE

And then the following in ~/.pydistutils.cfg:

[install]
install_lib = ~/Library/Python/$py_version_short/site-packages
install_scripts = ~/bin

I also had to do a one-time

mkdir ~/bin
mkdir -p $PYSITE

I used setuptools for the first 4Suite and Amara install, following the OS X specific instructions.
One wrinkle was that Firefox 2.0.0.1 refused to save the page with ez_setup.py so I could run it. I tried changing locations and all that to no avail. Smells like a bug. I just used Safari to get it in the end. I noticed that OS X doesn't seem to come with wget. After this set-up, a simple:

easy_install Amara

Worked like a champ, and so I had 4Suite and Amara installed. I also got them set up in CVS easily enough, with the above basic config in place.

MacPorts

I also installed MacPorts, following the install instructions.
I was able to log into Apple Developer Network very easily using my Apple Store ID. One problem is that The instructions say:

Click Customize, expand the Applications category and click the checkbox beside X11 SDK to add it to the default items.

But the XCode 2.4.1dmg I got had "X11 SDK" greyed out. I just went ahead anyway, and it turns out you must install X11 itself before XCode will allow you to install the X11 SDK. Makes sense, but the instructions on the install page have this backward.

As for installing X11 itself, the page says:

Insert the OS X 10.4 installation DVD and run the package named Additional Software.

For the MacBook Pro the installation DVD is labeled "Mac OS X Install Disc 1". The package is actually named "Optional Installs". I clicked through until I got to the page where I could select X11:

You also need to use sudo for the ports update, which isn't clear in the instructions:

sudo port -vd selfupdate

And that's really about as far as I got. I installed MacPorts just to have it handy, just in case. I might first put it to use for wget, which I won't be able to live without very long, and really should come with OS X.

[Uche Ogbuji]

via Copia

Amara 1.2rc1

4Suite has been bumped to 1.0.2 with some important bug fixes. I also pushed Amara a step closer to 1.2 with a 1.2rc1 release. I'll make it 1.2 final some time this week, and then on to some pretty big architectural changes for 2.0. All test reports are welcome, especially from Web server users. Jeremy might have figured out a workaround fo the multiple-interpreter issue discussed in "multiple interpreters and extension modules". That should fix remaining known problems with mod_python.

[Uche Ogbuji]

via Copia

Progress on two Reference Implementations for RETE and GRDDL

Whew! During the moments I was able to spare while at ISWC (I was only there on monday and tuesday, unfortunately), I finished up two 'reference' implementations I've been enthusiastically hacking on quite recently. By reference implementation, I mean an implementation that attempts to follow a specification verbatim as an exercise to get a better understanding of it.

I still need a write up on the engaging conversations I had at ISWC (it was really an impressive conference, even from just the 2 days worth I got to see) as well as the presentation I gave on using GRDDL and Web architecture to meet the requirements of Computer-based Patient Records.

FuXi and DLP

The first milestone was with FuXi, which I ended up rewriting completely based on some exchanges with Peter Lin and Charles Young.

This has probably been the most challenging piece of software I've ever written and I was quite niave in the beginning about my level of understanding of the nuances of RETE. Anyone interested in the formal intersection of Notation 3 / RDF syntax and the RETE algorithm(s) will find the exchanges in the comments of the above post very instructive - and Peter Lin's blog in general. Though he and I have our differences in the value of mapping RETE to RDF/N3 his insights into my efforts have been very helpful.

In the process, I discovered Robert Doorenbos PhD thesis "Production Matching for Large Learning Systems" which was incredibly valuable in giving me a comprehensive picture of how RETE (and RETE/UL) could be 'ported' to accomodate Notation 3 reasoning.

The primary motivation (which has led to many sleepless nights and late night hackery) is what I see as a lack of understanding within the community of semantic web developers of the limitations of Tableux-based reasoning and the common misconception that the major Tableaux-based reasoners (Fact++, Racer, Pellet, etc..) represent the ceiling of DL reasoning capability.

The reality is that logic programming has been around the block much longer than DL and has much more mature algorithms available (the primary one being RETE). I've invested quite a bit of effort in what I believe will (eventually) demonstrate very large-scale DL reasoning performance that will put Tableaux-based reasoning to shame - if only to make it clear that more investigation into the intersection of LP and DL is crucial for making the dream of the SW a verbatim reality.

Ofcouse, there is a limit to what aspects of DL can be expressed as LP rules (this subset is called Description Logic Programming). The 'original' DLP paper does well to explain this limitation, but I believe this subset represents the more commonly used portions of OWL and the portions of OWL 1.0 (and OWL 1.1 for that matter) left out by such an intersection will not be missed.
Ivan Herman pointed me to a paper by Horst, Herman which is quite comprehensive in outlining how this explicit intersection can be expressed axiomatically and the computational consequences of such an axiomatization. Ivan used this a guide for his RDFSClosure module.

Not enough (IMHO) has been done to explore this intersection because people are comfy with the confines of non-LP algorithms. The trail (currently somewhat cold) left by the Mindswap work on Pychinko needs to be picked up, followed and improved.

So, I rolled up my sleeves, dug deep and did my best to familiarize myself with the nuances of production system optimization. Most of the hard work has already been done, thanks to Robert Doorenbos subsetting (and extension) of the original Charles Forgy algorithm. FuXi, gets through a large majority the OWL tests using a ruleset that closely implements what Horst lays out in his paper and does so with impressive times - even with more optimizations pending.

The most recent changes include a command-line interface for launching it:

chimezie@Zion:~/devel/Fuxi$ python Fuxi.py --out=n3 --ruleFacts
--ns=owl=http://www.w3.org/2002/07/owl#
--ns=test=http://metacognition.info/FuXi/DL-SHIOF-test.n3#
--rules=test/DL-SHIOF-test.n3
Time to build production rule (RDFLib): 0.0172629356384 seconds
Time to calculate closure on working memory: 224.906921387 m seconds

@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix test: <http://metacognition.info/FuXi/DL-SHIOF-test.n3#>.

 test:Animal test:in _:XRfZIlKy56,
        _:XRfZIlKy57.

 test:Cosi_fan_tutte a test:DaPonteOperaOfMozart;
    test:in _:XRfZIlKy47,
        _:XRfZIlKy48,
        [].

 test:Don_Giovanni a test:DaPonteOperaOfMozart;
    test:in _:XRfZIlKy47,
        _:XRfZIlKy48.

 .... snip ...

FuXi still doesn't support 'built-ins' (or custom comparisons), but luckily the thesis includes a section on how to implement non equality testing of rule constraints that should be (relatively) easy to add. The theseis also includes a section on how negated conditions can be implemented (which is probably the most glaring axiom missing from DLP). Finally, Robert's paper includes a clever mechanism for hashing the Alpha network that has yet to be implemented (and is simple enough to implement) that should contribute significant performance gains.

There are other pleasant surprises with the current codebase. The rule compiler can be used to identify inefficencies in rule patterns, the command-line program can be used to serialize the closure delta (i.e., only the triples inferred from the ruleset and facts), and (my favorite) a Notation 3 ruleset can be exported as a graphviz diagram in order to visualize the rule network. Having 'browsed' various rule-sets in this way, I must say it helps in understanding the nuances of optimization when you can see the discrimination network that the triples are propagated through.

I don't have a permanent home for FuXi yet but have applied for a sourceforge project (especially since SF now supports SVN, apparently). So, until then, FuXi can be downloaded from:

GRDDL Client for 4Suite and RDFLib

During the same period, I've also been working on a 'reference' implementation of GRDDL (per the recently released Working Draft) for 4Suite and RDFLib. It's a bit ironic in doing so, since the 4Suite repository framework has essentially been using GRDDL-like Content Management Systems mechanisms since its inception (sometime in 2001).
However, I thought doing so would be the perfect oppurtunity to:

Demonstrate how 4Suite can be used with RDFLib (as a prep to the pending deprecation of 4Suite RDF for RDFLib)
Build a framework to compose illustrative test cases for the Working Group (of which I'm a member)
As a way to completely familiarize myself with the various GRDDL mechanisms

I posted this to the Working Group mailing list and plan to continue working on it. In particular, the nice thing about the convergence of these two projects of mine is that I've been able to think a bit about how both GRDDL and Fuxi could be used to implement efficient, opportunistic programs that extract RDF via GRDDL and explicit links (rdf:seeAlso, owl:imports, rdfs:isDefinedBy) and perform incremental web closures by adding the triples discovered in this way one at a time to a live RETE network.

The RETE algorithm is tailored specifically to behave as a black box to changes to a production system and so crawling the web, extracting RDF (a triple at a time) and reasoning over it in this way (the ultimate semantic web scenario) becomes a very real scenario with sub-second response times to boot. At the very least, reasoning should cease to be much of a bottleneck compared to the actual dereferencing and parsing of RDF from distributed locations. Very exciting prospects. Watch this space..

Chimezie Ogbuji

via Copia

Copia

Ogbujis on an abundance of topics

Tag python

Triclops gets a facelift, new query management capabilities, and new APIs

Using Amara's pushtree for heavyweight XML processing in GRDDL and SPARQL querying

Finding URLs in plain text

Numerical type with units - via Python Cookbook

SNOMED-CT Management via Semantic Web Open Source Tools Committed to Google Code

Indivo X - A Promising Framework for Personal Healthcare Records

Indivo X Alpha 1 released

Python APIs for the Upper Portions of the SW Layer Cake

First day as a Python/Mac developer

Picking a distribution

Getting started with MacPython (including setuptools)

MacPorts

Amara 1.2rc1

Progress on two Reference Implementations for RETE and GRDDL

FuXi and DLP

GRDDL Client for 4Suite and RDFLib