This Posterous hosts the archives of the old Copia Weblog, which was very active, and popular, from 2005 to 2007.
This is a dual-language blog entry. Nigerian Pidgin first, then the translation to en-US.
Poverty no good at all, oh
Na him make I join this business
419 no be thief, its just a game
Everybody dey play am
If anybody fall mugu, Ha! my brother, I go chop am
Translation: Poverty sucks, so I joined this business. 419 isn't stealing--it's just a game. Everybody does it. If anyone is stupid enough to fall for it, I'll get away with what I can.
National Stadium na me build am
President na my sister brother
You be the mugu, I be the master
Oyinbo I go chop your dollar, I go take your money disappear
you are the loser I am the winner
Probably no translation needed except to mention that Oyinbo means white man. Osuofia is a character from a few popular Nollywood comedy films, and really what this song is doing is two-fold. It's providing some fictional escape from the too real problem of poverty in Nigeria, among honest people and dishonest alike. It's also skewering the outrageous claims of 419 scam artists, along with the outrageous gullibility of those who fall for such claims. Think of it: you walk up to a man on a small town Nigerian street (say Okigwe, where I went to secondary school). You tell him "hey, if you were to send Americans an e-mail telling them you're the widow of the President, and that if they can get you $10,000 you'll get them $1,000,000 the president stole from his people." You might expect his reaction to be: "I can't imagine who would fall for such a silly story, but if they did, I don't feel sorry for them, because why should they want to help in theft from people who can so ill afford to lose anything?" You could also imagine this man wandering back to work with no lunch (he has to skip that meal to save money) dreaming of what he could do with $10,000 from a greedy, gullible hand overseas. Then a year later you go back to that same man and you tell him "Remember that scam I told you about? Well it's been going gangbusters, and there have been a lot of victims, and now people look at all Nigerians as just a bunch of spammer/scammers." Imagine his combination of bemusement, bewilderment and contempt for both the scammers and the vics. Most Nigerians handle such nonsense with black irony, and this is precisely the spirit of "I go chop ya dollar". I'd say that's obvious to any Nigerian who hears it, and the festive tone of the song is just the broadest clue. 419ers who enjoy the song probably employ intentional double irony. Which means, of course that the use of the song in the close of "This American Life" represents a triple irony. Which is pretty cool, even if they unwittingly gave the wrong impression about the song's audience.
[by Chimezie Ogbuji]
So, I haven't written about some recent capabilities I've been developing in FuXi. Ultimately, I would like it to be part of a powerful open-source library for semantic web middle-ware development. The current incarnation of this is python-dlp.
The core inference engine ( a RETE-based production system ) has been stable for some time now. FuXi is comprised of the following major modules:
- DLP (An implementation of the DLP transformation)
- Horn (an API for RIF Basic Logic Dialect)
- Rete (the inference engine)
- Syntax (Idiomatic APIs for OWL - InfixOWL)
I've written about all but the 2nd item (more on this in another article) and the last in the above list. I've come full circle on APIs for RDF several times, but I now believe that a pure object RDF model (ORM) approach will never be as useful as a object RDF vocabulary model. I.e., an API built for a specific RDF vocablary. The Manchester OWL syntax has quickly become my syntax of choice for OWL and (at the time) I had been toying with an idea of turning it into an API.
Then one day, I had an exchange with Sandro Hawke on IRC about what a Python API for the OWL Abstract Syntax would look like. I had never taken a close look at the abstract syntax until then and immediately came away thinking something more light-weight and idiomatic would be preferable.
I came across a powerful infix operator recipe for Python:
I wrote an initial, standalone module called InfixOWL and put up a wiki which still serves as decent initial documentation for the syntax. I've since moved it into a proper module in FuXi, fixed some bugs and very recently added even more idiomatic APIs.
The module defines the following top-level Python classes:
- Individual - Common class for 'type' descriptor
- AnnotatibleTerm(Individual) - Common class for 'label' and 'comment' descriptors
- Ontology(AnnotatibleTerm) - OWL ontology
- Class(AnnotatibleTerm) - OWL Class
- OWLRDFListProxy - Common class for class descriptions composed from boolean operators
- EnumeratedClass(Class) - Class descriptions consisting of owl:oneOf
- BooleanClass(Class,OWLRDFListProxy) - Common class for owl:intersectionOf / owl:unionOf descriptions
- Restriction(Class) - OWL restriction
- Property(AnnotatibleTerm) - OWL property
Example code speaks much louder than words, so below is a snippet of InfixOWL code which I used to compose (programmatically) the CPR ontology:
CPR = Namespace('http://purl.org/cpr/0.75#') INF = Namespace('http://www.loa-cnr.it/ontologies/InformationObjects.owl#') EDNS = Namespace('http://www.loa-cnr.it/ontologies/ExtendedDnS.owl#') DOLCE = Namespace('http://www.loa-cnr.it/ontologies/DOLCE-Lite.owl#') OBI = Namespace('http://obi.sourceforge.net/ontology/OBI.owl#') SNAP = Namespace('http://www.ifomis.org/bfo/1.0/snap#') SPAN = Namespace('http://www.ifomis.org/bfo/1.0/span#') REL = Namespace('http://www.geneontology.org/owl#') GALEN = Namespace('http://www.co-ode.org/ontologies/galen#') TIME = Namespace('http://www.w3.org/2006/time#') CYC = Namespace('http://sw.cyc.com/2006/07/27/cyc/') XML = Namespace('http://www.w3.org/2001/04/infoset#') g = Graph() g.namespace_manager = namespace_manager Class.factoryGraph = g Property.factoryGraph = g Ontology.factoryGraph = g cprOntology = Ontology("http://purl.org/cpr/owl") cprOntology.imports = ["http://obo.sourceforge.net/relationship/relationship.owl", DOLCE, #EDNS, URIRef("http://obi.svn.sourceforge.net/viewvc/*checkout*/obi/ontology/trunk/OBI.owl"), "http://www.w3.org/2006/time#"] representationOf = Property(CPR['representation-of'], inverseOf=Property(CPR['represented-by']), domain=[Class(CPR['representational-artifact'])], comment=[Literal("...")]) ... snip ... person = Class(CPR.person, subClassOf=[Class(SNAP.Object)]) ... snip ... clinician = Class(CPR.clinician) clinician.comment=Literal("A person who plays the clinician role (typically Nurse, Physician / Doctor,etc.)") #This expressio is equivalent to cpr:clinician rdfs:subClassOf cpr:person person+=clinician .. snip .. patientRecord = Class(CPR['patient-record']) patientRecord.comment=Literal("an electronic document (a representational artifact [REFTERM]) "+ "which captures clinically relevant data about a specific patient and "+ " is primarily comprised of one or more cpr:clinical-descriptions.") patientRecord.seeAlso = URIRef("http://ontology.buffalo.edu/bfo/Terminology_for_Ontologies.pdf") patientRecord.subClassOf = \ [bytes, #Class(CYC.InformationBearingThing), CPR['representation-of'] |only| patient, REL.OBO_REL_has_proper_part |some| Class(CPR['clinical-description'])] ... snip ... problemDisj=Class(CPR['pathological-state']) | organismalOccurrent | extraOrganismalOccurrent problem = Class(CPR['medical-problem'], subClassOf=[problemDisj, realizedBy|only|Class(CPR['screening-act']), DOLCE['has-quality'] |some| owlTimeQuality])
After the OWL graph is composed, it can be serialized by simply invoking:
g.serialize(format='pretty-xml')
Note: This is a semi-rant on the current state of healthcare and innovative technology and why we all should be motivated to do something more about it. The opinions expressed here are mine and mine alone (Chimezie Ogbuji).
We recently wrote-up a case study for the W3C Semantic Web Education and Outreach Interest Group:
"A Semantic Web Content Repository for Clinical Research"
A major difference between the user experience with SemanticDB and the previous interface to the relational technology-based Cardiovascular Information Registry (CVIR) that has accelerated adoption of Semantic Web technologies is the use of local terminology familiar within the domain rather than terms that are a consequence of the physical organization of the data. In addition, the model of the domain (expressed in OWL) is more amenable to extensions typically associated with targeted studies that introduce additional variables.
It is an overview of the work we have been doing on clinical research driven by the value of having well-curated population-level patient data. It is a very appropriate use case for the semantic web in two respects: the problems addressed by the specific technologies used are directly relevant for clinical research, and certain sociological aspects of the semantic web (altruism through innovative technology, open communities / standards / software, etc.). This later point isn't emphasized often enough, though I've been thinking quite a bit about it lately as I've been developing a compact ontology for medical records. This started as a side project associated with the activities in the W3C Semantic Web Healthcare and Life Sciences Interest Group that I am involved in but has since become a personal project to investigate a personal philosophy that has recently come in contact with the nature of my current work through a tragedy in my family.
One of the things, I would like to do is learn a bit about the ailments in my family through active engagement of the science behind these ailments. I'm a software hobbyist with aspirations for contributing to pragmatic application of knowledge representation to common human problems. I have access to all the technologies and tools that can make a personal medical record repository a reality for me. I have access to a massive, freely available, well organized ontology of clinical phenomenon (GALEN). Common sense suggests that there is no one more motivated to learn if such an excercise is fruitful than myself. I could sit around, waiting for modern medicine to catch up with the reality of the innovative technologies of today, but why should I wait? Why should we wait is the question I really want to ask, but at the very least I can do something about my immediate situation (we have royalty-free standards, open source software, and open communities to thank for that).
If I have tools which can draw sound conclusions from well-currated information about all my medical history (and the medical history of my loved ones), document the complete set of justifications to these conslusions, reduce forms-based management of this information to a trivial task, and can be stored in a content repository, is it not in my interest to take advantage of these tools to the benefit of my health and the health of my loved ones?
To a certain extent, applying innovative technology at the point of care or for research purposes is a win-win. No brainer, really. At least it shouldn't be. It is in the best interest of both healthcare providers and recipients of healthcare services to leverage innovative technologies.
I work for a non-for-profit organization with a mission statement to the people of Cleveland. I'm one of those types who take such thing serously (oaths, mission statements, etc.). I was born (and essentially raised) in the greater Cleveland area. I have (young) children and family here. In addition, there is a strong history of hypertension and diabetes in my genetic lineage. I've lost loved ones at the point of care. The combination of these things makes the work I do much more relevant for me and as such I take it very seriously.
The ridiculous cost of healthcare, its effectiveness, and curation of expressive, patient data for the benefit of scientific reserach should be thought of first as a problem that modern science has a duty to solve rather than simply a business oppurtunity. A certain minimal amount of altruism is required. Anything less would be a diservice to the silent oaths that nurses take when they dedicate their professional lives to the healthcare of the populace with a vigor that only few can demonstrate. My mom was (and is) an incredible nurse, so I should know a little something about this.
At any rate, I think collectively we sit at a point of transition and reformation in the way healthcare information is housed, managed, and leveraged. I believe the shape of things to come will depend largely on how well we understand that altruism deserves a very prominent seat in all this. Everyone is effected by unecessarily expensive healthcare costs, even the providers themselves.
I've been doing alot of thinking (and prototyping) of Symmetric Multi Processor evaluation of RDF queries against large RDF datasets. How could you Map/Reduce an evaluation of SPARQL / Versa (ala HbaseRDF), for instance? Entailment-free Graph matching is the way to go at large volumes. I like to eat my own dog food, so I'll start with my toolkit.
Let us say you had a large RDF dataset of historical characters modelled in FOAF, where the Graph names match the URI's assigned to the characters. How would you evaluate a query like the follwing (both serially and in concurrent fashion)?
BASE <http://xmlns.com/foaf/0.1/> PREFIX rel: <http://purl.org/vocab/relationship/> PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/> SELECT ?mbox WHERE { ?kingWen :name "King Wen"; rel:parentOf ?son . GRAPH ?son { [ :mbox ?mbox ] }. }
Written in a the SPARQL abstract syntax:
Join(BGP(?kingWen foaf:name "King Wen". ?kingWen rel:parentOf ?son),Graph(?son,BGP(_:a foaf:mbox ?mbox)))
If you are evaluating it niavely, it would reduce (via algebra) to:
Join(eval(Dh, BGPa), eval(Dh, Graph(?son,BGPb)))
Where Dh denotes the RDF dataset of historical literary characters, BGPa denotes the BGP expression
?kingWen foaf:name "King Wen". ?kingWen rel:parentOf ?son
BGPb denotes
_:a :mbox ?mbox
The current definition of the Graph operator as well as the one given by Perez, Jorge et.al. seem (to me) amenable to parallel evaluation. Let us take a look at the operational semantics of evuating the same query in Versa:
map( (*|-foaf:name->"King Wen")-rel:parentOf->*, 'scope(".-foaf:mbox->*",.)', )
Versa is a Graph-traversal-based RDF querying language. This has the advantage a computational graph theory that we can rely on for analysis of a query's complexity. Parallel evaluation of (directed) Graph traversals appear to be an open problem for a deterministic turing machine. The input to the map function would be the URIs of the sons of King Wen. The map function would be the evaluation of expression:
scope(".-foaf:mbox->*",.)
This would seem to be the equivalent of the evaluation of the Graph(..son URI..,BGPb) SPARQL abstract expression. So far, so good. Parallel evaluation can be implemented in a manner that is transparent to the application. An analysis of the evaluation using some complexity theory would be interesting to see if RDF named graph traversal queries have a data complexity that scales.
Yay. GRDDL is now a W3C Recommendation!
I'm very proud to be a part of that and there is alot about this particular architectural style that I have always wanted to write about. I recently came upon the opportunity to consider one particular facet.
This is why it seems the same as with GRDDL. There are tranformations you can make, but they are not entailments in a logic, they just go from one graph to a different graph.
- Sandro Hawke on W3C Rules Interchange Format Working Group (Tue, 18 Sep 2007)
Yes, that is one part of the larger framework that is well considered. GRDDL does not rely on logical entaiment for its normative definition. It is defined operationally, but can also be described via declarative (formal) semantics. It defines a mapping (not a function in the true sense - the specification clearly identifies ambiguity at the level of the infoset) from an XML representation of an "information resource" to a typed RDF representation of the same "information resource". The output is required to have a well-defined mapping of its own into the RDF abstract syntax.
The less formal definition uses a dialect of Notation 3 that is a bit more expressive than Datalog Logic Programming (it uses function symbols - builtins - in some of the clauses ). The proof at the bottom of that page justifies the assertion that http://www.w3.org/2001/sw/grddl-wg/td/titleauthor.html has a GRDDL result which is composed entirely of the following RDF statement:
<http://musicbrainz.org/mm-2.1/album/6b050dcf-7ab1-456d-9e1b-c3c41c18eed2> is named "Are You Experienced?" .
Frankly, I would have gone with "Bold as Love", myself =)
Once you have a (mostly) well-defined function for rendering RDF from information resources, you enable the deployment of useful ( and re-usable ) interpretations for intelligent agents (more on these later). For example, the test suite, is a large semantic web of XML documents that GRDDL-aware agents can traverse, performing Quality Assurance tests (using EARL) of their conformance to the operational semantics of GRDDL.
However, it was very important to leave entailment out of the equation until it serves a justifiable purpose. For example, a well-designed RDF querying language does not require logical entailment (RDF, RDFS, OWL, or otherwise) for it to be useful in the general case. You can calculate a closure (or Herbrand base) and then dispatch structural graph queries. This was always true with Versa. You can glean (pun intended) quite a bit from only the structural nature of a Graph. A whole generation of graph theoretical literature demonstrates this.
In addition, once you have a well-defined set of semantics for identifying information resources with RDF assertions that are (logically) relevant to the closure, you have a clear seperation between manipulation of surface syntax and full-blown logical reasoning systems.
It should be considered a semantic web architectural style (if you will) to constrain the use of entailment to only where it has some demonstrated value to the problem space. Where it makes sense to use entailment, however, you will find the representations are well-engineered for the task.
I'm about ready to check-in some updates for proof generation in FuXi, and was having some fun with httpRange-14 I wanted to share. Some time ago, I wrote Emeka, a phenny-based IRC bot to experiment with how intelligent agents might best be used over IRC. I've recently been playing around with using REGEX patterns to match and invoke "Semantic Web" services (none seem "generally" useful, so far). But, I recently added a "hook" to generate proofs of RDF assertions from rulesets and ontologies. Below is how Emeka & FuXi deduce that http://dbpedia.org/data/DBpedia is a "document" as defined by the FOAF ontology:
<chimezie> .varBind proofSrc list('http://dbpedia.org/data/DBpedia','http://xmlns.com/foaf/spec/') <chimezie> Emeka, proove "@prefix foaf: <http://xmlns.com/foaf/0.1/>. <http://en.wikipedia.org/wiki/DBpedia> a foaf:Document ." from $proofSrc <Emeka> Calculated closure in 514.962911606 milli seconds <Emeka> reached Q.E.D. See: http://rafb.net/p/5PvAhz98.html
The first line uses a Versa expression to build a variable bound to a list of resources, which can be referenced within Versa expressions (these are the ontologies that should be used with "graph links"). The second line uses a combination of Notation 3 (to express the "goal") and Versa (for the list of information resources). I'm still considering whether having an IRC bot automatically paste content to a "paste bin" is rude.
The contents of the pastebin comprises a (somewhat) human-readable (linear) proof.
Human-readable proof: Building nodeset around antecedent goal (justified by a direct assertion): foaf:page(dbpedia:DBpedia wiki:DBpedia) Building inference step for Proof step for foaf:topic(wiki:DBpedia dbpedia:DBpedia) Inferred from RETE node via foaf:topic(?X ?sgflxxAo413) :- foaf:page(?sgflxxAo413 ?X) Bindings: {?X: rdflib.URIRef('http://en.wikipedia.org/wiki/DBpedia'), ?sgflxxAo413: rdflib.URIRef('http://dbpedia.org/resource/DBpedia')} Building nodeset around antecedent goal: foaf:topic(wiki:DBpedia dbpedia:DBpedia) Building inference step for Proof step for foaf:Document(wiki:DBpedia) Inferred from RETE node via foaf:Document(?X) :- foaf:topic(?X ?sgflxxAo1286) Bindings: {?X: rdflib.URIRef('http://en.wikipedia.org/wiki/DBpedia'), ?sgflxxAo1286: rdflib.URIRef('http://dbpedia.org/resource/DBpedia')} Building proof around goal: foaf:Document(wiki:DBpedia) ## Generated by Emeka at 2007-09-12T20:52:53.728814 in 6.64019584656 milli seconds
Note, the proof doesn't rely on what the transport protocol says about http://en.wikipedia.org/wiki/DBpedia
I had some problems getting the Standford Inference Web proof browser to render FuXi's RDF/XML serialization, I wonder if it is because of my assumption (in the serialization) that the 'entailment' semantics of a RETE network is equivalent to Generalized Modus Ponens. Anyway, the diagram FuXi generates is below:
This was generated via:
Fuxi --output=dot --ns=wiki=http://en.wikipedia.org/wiki/ --ns=dbpedia=http://dbpedia.org/resource/ --ns=foaf=http://xmlns.com/foaf/0.1/ --proove="@prefix foaf: <http://xmlns.com/foaf/0.1/>. <http://en.wikipedia.org/wiki/DBpedia> a foaf:Document ." --dlp http://xmlns.com/foaf/spec/ http://dbpedia.org/data/DBpedia
I just need to wrap up the doctests and unit tests and push the latest build to cheeseshop and google code
[by Chimezie Ogbuji]
I've been doing alot of "Google reading" lately
Completing Logical Reasoning System Capabilities
With the completion (or near completion) of PML-generating capabilities for FuXi, it is becoming a fully functional logical reasoning system. In "(Artificial Intelligence: A Modern Approach)" Stuart Russel and Peter Norvig identify the following main categories for automated reasoning systems:
- Theorem provers
- Production systems
- Frame systems and semantic networks
- Description Logic systems
OWL and RDF are coverage for 3 and 4. The second category is functionally covered by the RETE-UL algorithm FuXi employs (a highly efficient modification of the original RETE algorithm). The currently developing RIF Basic Logic Dialect covers 2 - 4. Proof Markup Language covers 1. Now, FuXi can generate (and export visualization diagrams) Proof Markup Language (PML) structures. I still need to do more testing, and hope to be able to generate proofs for each of the OWL tests. Until then, below is a diagram of the proof tree generated from the "She's a Witch and I have Proof" test case:
# http://clarkparsia.com/weblog/2007/01/02/burn-the-witch/ # http://www.netfunny.com/rhf/jokes/90q4/burnher.html @prefix : <http://www.w3.org/2000/10/swap/test/reason/witch#>. @keywords is, of, a. #[1] BURNS(x) /\ WOMAN(x) => WITCH(x) { ?x a BURNS. ?x a WOMAN } => { ?x a WITCH }. #[2] WOMAN(GIRL) GIRL a WOMAN. #[3] \forall x, ISMADEOFWOOD(x) => BURNS(x) { ?x a ISMADEOFWOOD. } => { ?x a BURNS. }. #[4] \forall x, FLOATS(x) => ISMADEOFWOOD(x) { ?x a FLOATS } => { ?x a ISMADEOFWOOD }. #[5] FLOATS(DUCK) DUCK a FLOATS. #[6] \forall x,y FLOATS(x) /\ SAMEWEIGHT(x,y) => FLOATS(y) { ?x a FLOATS. ?x SAMEWEIGHT ?y } => { ?y a FLOATS }. # and, by experiment # [7] SAMEWEIGHT(DUCK,GIRL)
There is another test case of the "Dan's home region is Texas" test case on a python-dlp Wiki: DanHomeProof:
@prefix : <gmpbnode#>. @keywords is, of, a. dan home [ in Texas ]. { ?WHO home ?WHERE. ?WHERE in ?REGION } => { ?WHO homeRegion ?REGION }.
I decided to use PML structures since there are a slew of Stanford tools which understand / visualize it and I can generate other formats from this common structure (including the CWM reason.n3 vocabulary). Personally, I prefer the proof visualization to the typically verbose step-based Q.E.D. proof.
Update: I found nice write-up on the CWM-based reason ontology and translations to PML
So, how does a forward-chaining production rule system generate proofs that are really meant for backward chaining algorithms? When the FuXi network is fed initial assertions, it is told what the 'goal' is. The goal is a single RDF statement which is being prooved. When the forward-chaining results in a inferred triple which matches the goal, it terminates the RETE algorithm. So, depending on the order of the rules and the order that the initial facts are fed it will be (for the general cases) less efficient than a backward chaining algorithm. However, I'm hoping the blinding speed of the fully hashed RETE-UL algorithm makes up the difference.
I've been spending quite a bit of time on FuXi mainly because I am interested in empirical evidence which supports a school of thought which claims that Description Logic based inference (Tableaux-based inference) will never scale as well the Logic Programming equivalent - at least for certain expressive fragments of Description Logic (I say expressive because even given the things you cannot express in this subset of OWL-DL there is much more in Horn Normal Form (and Datalog) that you cannot express even in the underlying DL for OWL 1.1). The genesis of this is a paper I read, which lays out the theory, but there was no practice to support the claims at the time (at least that I knew of). If you are interested in the details, the paper is "Description Logic Programs: Combining Logic Programs with Description Logic" and written by many people who are working in the Rule Interchange Format Working Group.
It is not light reading, but is complementary to some of Bijan's recent posts about DL-safe rules and SWRL.
A follow-up is a paper called "A Realistic Architecture for the Semantic Web" which builds on the DLP paper and makes claims that the current OWL (Description Logic-based) Semantic Web inference stack is problematic and should instead be stacked ontop of Logic Programming since Logic Programming algorithm has a much richer and pervasively deployed history (all modern relational databases, prolog, etc..)
The arguments seem sound to me, so I've essentially been building up FuXi to implement that vision (especially since it employes - arguably - the most efficient Production Rule inference algorithm). The final piece was a fully-functional implementation of the Description Horn Logic algorithm. Why is this important? The short of it is that the DLP paper outlines an algorithm which takes a (constrained) set of Description Logic expressions and converts them to 'pure' rules.
Normally, Logic Programming N3 implementations pass the OWL tests by using a generic ruleset which captures a subset of the OWL DL semantics. The most common one is owl-rules.n3. DLP flips the script by generating a rule-set specifically for the original DL, instead of feeding OWL expressions into the same network. This allows the RETE-UL algorithm to create an even more efficient network since it will be tailored to the specific bits of OWL.
For instance, where I used to run through the OWL tests in about 4 seconds, I can now pass them in about 1 secound using. Before I would setup a RETE network which consisted of the generic ruleset once and run the tests through it (resetting it each time). Now, for each test, I create a custom network, evaluate the OWL test case against it. Even with this extra overhead, it is still 4 times faster! The custom network is trivial in most cases.
Ultimately I would like to be able to use FuXi for generic "Semantic Web" agent machinery and perhaps even to try some that programming-by-proof thing that Dijkstra was talking about.
I've somehow found myself wrapped-up in this dialog about information resources, their representations, and the relation to RDF. Perhaps it's the budding philosopher in me which finds the problem interesting. There seems to be some controversy about what is an appropriate definition for an information resource. I'm a big fan of not reinventing wheels if they have already been built, tested, and deployed.
The Architecture of the World-Wide Web says:
The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as "information resources."
I know of at least 4 very well-organized upper ontologies which have readily-available OWL representations: SUMO, Cyc, Basic Formal Ontology, and DOLCE. These are the cream of the crop in my opinion (and in the opinion of many others who are more informed about this type of thing). So, let us spend some time investigating where the poorly-defined Web Architecture term fits in these ontologies. This exercise is mostly meant for the purpose of reference. Every well-organized, upper ontology will typically have a singular, topmost term which covers everything. This would be (for the most part) the equivalent of owl:Thing and rdf:Resource
Suggested Upper Merged Ontology (SUMO)
Sumo has a term called "FactualText" which seems appropriate. The definition states:
The class of Texts that purport to reveal facts about the world. Such texts are often known as information or as non-fiction. Note that something can be an instance of FactualText, even if it is wholly inaccurate. Whether something is a FactualText is determined by the beliefs of the agent creating the text.
The SUMO term has the following URI for FactualText (at least in the OWL export I downloaded):
http://reliant.teknowledge.com/DAML/SUMO.owl#FactualText
Climbing up the subsumption tree we have the following ancestral path:
- Text: "A LinguisticExpression or set of LinguisticExpressions that perform a specific function related to Communication, e.g. express a discourse about a particular topic, and that are inscribed in a CorpuscularObject by Humans."
The term Text has multiple parents (LinguisticExpression and Artifact). Following the path upwards from the first parent we have:
- LinguisticExpression: "This is the subclass of ContentBearingPhysical which are language-related. Note that this Class encompasses both Language and the the elements of Languages, e.g. Words."
- ContentBearingPhysical: "Any Object or Process that expresses content. This covers Objects that contain a Proposition, such as a book, as well as ManualSignLanguage, which may similarly contain a Proposition."
- Physical "An entity that has a location in space-time. Note that locations are themselves understood to have a location in space-time."
- Entity "The universal class of individuals. This is the root node of the ontology."
Following the path upwards from the second parent we have:
- Artifact: "A CorpuscularObject that is the product of a Making."
- CorpuscularObject: "A SelfConnectedObject whose parts have properties that are not shared by the whole."
- SelfConnectedObject: "A SelfConnectedObject is any Object that does not consist of two or more disconnected parts."
- Object: "Corresponds roughly to the class of ordinary objects. Examples include normal physical objects, geographical regions, and locations of Processes"
Objects are a specialization of Physical, so from here we come to the common Entity ancestor
Cyc
Cyc has a term called InformationBearingThing:
A collection of spatially-localized individuals, including various actions and events as well as physical objects. Each instance of information-bearing thing (or IBT ) is an item that contains information (for an agent who knows how to interpret it). Examples: a copy of the novel Moby Dick; a signal buoy; a photograph; an elevator sign in Braille; a map ...
The Cyc URI for this term is:
http://sw.cyc.com/2006/07/27/cyc/InformationBearingThing
This term has 3 ancestors: Container-Underspecified, SpatialThing-Localized, and InformationStore. The latter seems most relevant, so we'll traverse its ancestry first:
- InformationStore : "A specialization of partially intangible individual. Each instance of store of information is a tangible or intangible, concrete or abstract repository of information. The information stored in an information store is stored there as a consequence of the actions of one or more agents."
- PartiallyIntangibleIndividual : "A specialization of both individual and partially intangible thing. Each instance of partially intangible individual is an individual that has at least some intangible (i.e. immaterial) component. The instance might be partly tangible (e.g. a copy of a book) and thus be a composite tangible and intangible thing, or it might be fully intangible (e.g. a number or an agreement) and thus be an instance of intangible individual object. "
From here, there are two ancestral paths, so we'll leave it at that (we already have the essense of the definition).
Going back to InformationBearingThing, below is the ancestral path starting from Container-Underspecified:
- Container-Underspecified : "The collection of objects, tangible or otherwise, which are typically conceptualized by human beings for purposes of common-sense reasoning as containers. Thus, container underspecified includes not only the set of all physical containers, like boxes and suitcases, but metaphoric containers as well"
- Area: "The collection of regions/areas, tangible or otherwise, which are typically conceptualized by human beings for purposes of common-sense reasoning as spatial regions."
- Location-Underspecified: Similar definition as Area
- Thing: "thing is the universal collection : the collection which, by definition, contains everything there is. Every thing in the Cyc ontology -- every individual (of any kind), every set, and every type of thing -- is an instance of (see Isa) thing"
Basic Formal Ontology (BFO)
BFO is (as the name suggests) very basic and meant to be an axiomatic implementation of the philosophy of realism. As such, the closest term for an information resource is very broad: Continuant
Definition: An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts.
However, I happen to be quite familiar with an extension of BFO called the Ontology of Biomedical Investigation (OBI) which has an appropriate term (derived from Continuant): information_content_entity
The URI for this term is:
http://obi.sourceforge.net/ontology/OBI.owl#OBI_342
Traversing the (short) ancestral path, we have the following definitions:
- OBI_295 : "An information entity is a dependent_continuant which conveys meaning and can be documented and communicated."
- OBI_321 : "generically_dependent_continuant"
- Continuant : "An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts."
- Entity
The Descriptive Ontology of Linguistics and Cognitive Engineering (DOLCE)
DOLCE's closest term for an information resource is information-object:
Information objects are social objects. They are realized by some entity. They are ordered (expressed according to) by some system for information encoding. Consequently, they are dependent from an encoding as well as from a concrete realization.They can express a description (the ontological equivalent of a meaning/conceptualization), can be about any entity, and can be interpreted by an agent.From a communication perspective, an information object can play the role of "message". From a semiotic perspective, it playes the role of "expression".
The URI for this term is:
http://www.loa-cnr.it/ontologies/ExtendedDnS.owl#information-object
Traversing the ancestral path we have:
- non-agentive-social-object: "A social object that is not agentive in the sense of adopting a plan or being acted by some physical agent. See 'agentive-social-object' for more detail."
- social-object: "A catch-all class for entities from the social world. It includes agentive and non-agentive socially-constructed objects: descriptions, concepts, figures, collections, information objects. It could be equivalent to 'non-physical object', but we leave the possibility open of 'private' non-physical objects."
- non-physical-object : "Formerly known as description. A unitary endurant with no mass (non-physical), generically constantly depending on some agent, on some communication act, and indirectly on some agent participating in that act. Both descriptions (in the now current sense) and concepts are non-physical objects."
- non-physical-endurant: "An endurant with no mass, generically constantly depending on some agent. Non-physical endurants can have physical constituents (e.g. in the case of members of a collection)."
- endurant : "The main characteristic of endurants is that all of them are independent essential wholes. This does not mean that the corresponding property (being an endurant) carries proper unity, since there is no common unity criterion for endurants. Endurants can 'genuinely' change in time, in the sense that the very same endurant as a whole can have incompatible properties at different times."
- particular: "AKA 'entity'.Any individual in the DOLCE domain of discourse. The extensional coverage of DOLCE is as large as possible, since it ranges on 'possibilia', i.e all possible individuals that can be postulated by means of DOLCE axioms. Possibilia include physical objects, substances, processes, qualities, conceptual regions, non-physical objects, collections and even arbitrary sums of objects."
Discussion
The definitions are (in true philosophical form) quite long-winded. However, the point I'm trying to make is:
- Alot of pain has gone into defining these terms
- Each of these ontologies is very richly-axiomatized (for supporting inference)
- Each of these ontologies is available in OWL/RDF
Furthermore, these ontologies were specifically designed to be domain-independent and thus support inference across domains. So, it makes sense to start here for a decent (axiomatized) definition. What is interesting is that SUMO and BFO are the only upper ontologies which treat information resources (or their equivalent term) as strictly 'physical' things. Cyc's definition includes both tangible and intangible things while DOLCE's definition is strictly intangible (non-physical-endurant)
Some food for thought