Ontological Definitions for an Information Resource

I've somehow found myself wrapped-up in this dialog about information resources, their representations, and the relation to RDF. Perhaps it's the budding philosopher in me which finds the problem interesting. There seems to be some controversy about what is an appropriate definition for an information resource. I'm a big fan of not reinventing wheels if they have already been built, tested, and deployed.

The Architecture of the World-Wide Web says:

The distinguishing characteristic of these resources is that all of their essential characteristics can be conveyed in a message. We identify this set as "information resources."

I know of at least 4 very well-organized upper ontologies which have readily-available OWL representations: SUMO, Cyc, Basic Formal Ontology, and DOLCE. These are the cream of the crop in my opinion (and in the opinion of many others who are more informed about this type of thing). So, let us spend some time investigating where the poorly-defined Web Architecture term fits in these ontologies. This exercise is mostly meant for the purpose of reference. Every well-organized, upper ontology will typically have a singular, topmost term which covers everything. This would be (for the most part) the equivalent of owl:Thing and rdf:Resource

Suggested Upper Merged Ontology (SUMO)

Sumo has a term called "FactualText" which seems appropriate. The definition states:

The class of Texts that purport to reveal facts about the world. Such texts are often known as information or as non-fiction. Note that something can be an instance of FactualText, even if it is wholly inaccurate. Whether something is a FactualText is determined by the beliefs of the agent creating the text.

The SUMO term has the following URI for FactualText (at least in the OWL export I downloaded):

http://reliant.teknowledge.com/DAML/SUMO.owl#FactualText

Climbing up the subsumption tree we have the following ancestral path:

  • Text: "A LinguisticExpression or set of LinguisticExpressions that perform a specific function related to Communication, e.g. express a discourse about a particular topic, and that are inscribed in a CorpuscularObject by Humans."

The term Text has multiple parents (LinguisticExpression and Artifact). Following the path upwards from the first parent we have:

  • LinguisticExpression: "This is the subclass of ContentBearingPhysical which are language-related. Note that this Class encompasses both Language and the the elements of Languages, e.g. Words."
  • ContentBearingPhysical: "Any Object or Process that expresses content. This covers Objects that contain a Proposition, such as a book, as well as ManualSignLanguage, which may similarly contain a Proposition."
  • Physical "An entity that has a location in space-time. Note that locations are themselves understood to have a location in space-time."
  • Entity "The universal class of individuals. This is the root node of the ontology."

Following the path upwards from the second parent we have:

  • Artifact: "A CorpuscularObject that is the product of a Making."
  • CorpuscularObject: "A SelfConnectedObject whose parts have properties that are not shared by the whole."
  • SelfConnectedObject: "A SelfConnectedObject is any Object that does not consist of two or more disconnected parts."
  • Object: "Corresponds roughly to the class of ordinary objects. Examples include normal physical objects, geographical regions, and locations of Processes"

Objects are a specialization of Physical, so from here we come to the common Entity ancestor

Cyc

Cyc has a term called InformationBearingThing:

A collection of spatially-localized individuals, including various actions and events as well as physical objects. Each instance of information-bearing thing (or IBT ) is an item that contains information (for an agent who knows how to interpret it). Examples: a copy of the novel Moby Dick; a signal buoy; a photograph; an elevator sign in Braille; a map ...

The Cyc URI for this term is:

http://sw.cyc.com/2006/07/27/cyc/InformationBearingThing

This term has 3 ancestors: Container-Underspecified, SpatialThing-Localized, and InformationStore. The latter seems most relevant, so we'll traverse its ancestry first:

  • InformationStore : "A specialization of partially intangible individual. Each instance of store of information is a tangible or intangible, concrete or abstract repository of information. The information stored in an information store is stored there as a consequence of the actions of one or more agents."
  • PartiallyIntangibleIndividual : "A specialization of both individual and partially intangible thing. Each instance of partially intangible individual is an individual that has at least some intangible (i.e. immaterial) component. The instance might be partly tangible (e.g. a copy of a book) and thus be a composite tangible and intangible thing, or it might be fully intangible (e.g. a number or an agreement) and thus be an instance of intangible individual object. "

From here, there are two ancestral paths, so we'll leave it at that (we already have the essense of the definition).

Going back to InformationBearingThing, below is the ancestral path starting from Container-Underspecified:

  • Container-Underspecified : "The collection of objects, tangible or otherwise, which are typically conceptualized by human beings for purposes of common-sense reasoning as containers. Thus, container underspecified includes not only the set of all physical containers, like boxes and suitcases, but metaphoric containers as well"
  • Area: "The collection of regions/areas, tangible or otherwise, which are typically conceptualized by human beings for purposes of common-sense reasoning as spatial regions."
  • Location-Underspecified: Similar definition as Area
  • Thing: "thing is the universal collection : the collection which, by definition, contains everything there is. Every thing in the Cyc ontology -- every individual (of any kind), every set, and every type of thing -- is an instance of (see Isa) thing"

Basic Formal Ontology (BFO)

BFO is (as the name suggests) very basic and meant to be an axiomatic implementation of the philosophy of realism. As such, the closest term for an information resource is very broad: Continuant

Definition: An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts.

However, I happen to be quite familiar with an extension of BFO called the Ontology of Biomedical Investigation (OBI) which has an appropriate term (derived from Continuant): information_content_entity

The URI for this term is:

http://obi.sourceforge.net/ontology/OBI.owl#OBI_342

Traversing the (short) ancestral path, we have the following definitions:

  • OBI_295 : "An information entity is a dependent_continuant which conveys meaning and can be documented and communicated."
  • OBI_321 : "generically_dependent_continuant"
  • Continuant : "An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts."
  • Entity

The Descriptive Ontology of Linguistics and Cognitive Engineering (DOLCE)

DOLCE's closest term for an information resource is information-object:

Information objects are social objects. They are realized by some entity. They are ordered (expressed according to) by some system for information encoding. Consequently, they are dependent from an encoding as well as from a concrete realization.They can express a description (the ontological equivalent of a meaning/conceptualization), can be about any entity, and can be interpreted by an agent.From a communication perspective, an information object can play the role of "message". From a semiotic perspective, it playes the role of "expression".

The URI for this term is:

http://www.loa-cnr.it/ontologies/ExtendedDnS.owl#information-object

Traversing the ancestral path we have:

  • non-agentive-social-object: "A social object that is not agentive in the sense of adopting a plan or being acted by some physical agent. See 'agentive-social-object' for more detail."
  • social-object: "A catch-all class for entities from the social world. It includes agentive and non-agentive socially-constructed objects: descriptions, concepts, figures, collections, information objects. It could be equivalent to 'non-physical object', but we leave the possibility open of 'private' non-physical objects."
  • non-physical-object : "Formerly known as description. A unitary endurant with no mass (non-physical), generically constantly depending on some agent, on some communication act, and indirectly on some agent participating in that act. Both descriptions (in the now current sense) and concepts are non-physical objects."
  • non-physical-endurant: "An endurant with no mass, generically constantly depending on some agent. Non-physical endurants can have physical constituents (e.g. in the case of members of a collection)."
  • endurant : "The main characteristic of endurants is that all of them are independent essential wholes. This does not mean that the corresponding property (being an endurant) carries proper unity, since there is no common unity criterion for endurants. Endurants can 'genuinely' change in time, in the sense that the very same endurant as a whole can have incompatible properties at different times."
  • particular: "AKA 'entity'.Any individual in the DOLCE domain of discourse. The extensional coverage of DOLCE is as large as possible, since it ranges on 'possibilia', i.e all possible individuals that can be postulated by means of DOLCE axioms. Possibilia include physical objects, substances, processes, qualities, conceptual regions, non-physical objects, collections and even arbitrary sums of objects."

Discussion

The definitions are (in true philosophical form) quite long-winded. However, the point I'm trying to make is:

  • Alot of pain has gone into defining these terms
  • Each of these ontologies is very richly-axiomatized (for supporting inference)
  • Each of these ontologies is available in OWL/RDF

Furthermore, these ontologies were specifically designed to be domain-independent and thus support inference across domains. So, it makes sense to start here for a decent (axiomatized) definition. What is interesting is that SUMO and BFO are the only upper ontologies which treat information resources (or their equivalent term) as strictly 'physical' things. Cyc's definition includes both tangible and intangible things while DOLCE's definition is strictly intangible (non-physical-endurant)

Some food for thought

Chimezie Ogbuji

via Copia

Why Web Architecture Shouldn't Dictate Meaning

This is a very brief demonstration motivated by some principled arguments I've been making over the last week or so regarding Web Architecture dictates which are ill-concieved and may do more damage to the Semantic Web than good. A more fully articulated argument is sketched out in "HTTP URIs are not Without Expense" and "Semiotics of RDF Signs". In particular, the argument about why most of the httpRange-14 dialog is confusing dereference with denotation. I've touched on some of this before.

Anywho, the URI I've minted for myself is

http://metacognition.info/profile/webwho.xrdf#chime

When you 'dereference' it, the server responds with:

chimezie@otherland:~/workspace/Ontologies$ curl -I http://metacognition.info/profile/webwho.xrdf#chime
HTTP/1.1 200 OK
Date: Thu, 30 Aug 2007 06:29:12 GMT
Server: Apache/2.2.3 (Debian) DAV/2 SVN/1.4.2 mod_python/3.2.10 Python/2.4.4 PHP/4.4.4-8+etch1 proxy_html/2.5 mod_ssl/2.2.3 OpenSSL/0.9.8c mod_perl/2.0.2 Perl/v5.8.8
Last-Modified: Mon, 23 Apr 2007 03:09:22 GMT
Content-Length: 6342
Via: 1.1 www.metacognition.info
Expires: Thu, 30 Aug 2007 07:28:26 GMT
Age: 47
Content-Type: application/rdf+xml

According to TAG dictate, a 'web' agent can assume it refers to a document (yes, apparently an RDF document composed this blog you are reading).

Update: Bijan points out that my example is mistaken. The TAG dictate only allows the assumption to be made of the URI which goes across the wire (the URI with the fragment stripped off). The RDF (FOAF) doesn't make any assertions about this (stripped) URI being a foaf:Person. This is technically correct, however, the concern I was highlighting still holds (albeit it is more likely to confuse folks who are already confused about dereference and denotation). The assumption still gets in the way of 'proper' interpretation. Consider if I had used the FOAF graph URL as the URL for me. Under which mantra would this be taboo? Furthermore, if I wanted to avoid confusing unintelligent agents such as this one above, which URI scheme would I be likely to use? Hmmm...

Okay, a more sophisticated semantic web agent parses the RDF and understands (via the referential mechanics of model theory) that the URI denotes a foaf:Person (much more reasonable). This agent is also much better equipped to glean 'meaning' from the model-theoretic statements made about me instead of jumping to binary conclusions.

So I ask you, which agent is hampered by a dictate that has all to do with misplaced pragmatics and nothing to do with semantics? Until we understand that the 'Semantic Web' is not 'Web-based Semantics', Jim Hendler's question about where all the agents are (Where are all the agents?) will continue to go unanswered and Tim Bray's challenge will never be fulfilled.

A little tongue-in-cheek, but I hope you get the point

Chimezie Ogbuji

via Copia

Linked Data and Overselling the HTTP URI Scheme

So, I'm going to do something which may not be well-recieved: I'm going to push-back (slightly) on the Linked Data movement, because, frankly, I think it is a bit draconian with respect to the way it oversells the HTTP URI scheme (points 3 and 4):

2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful information.

There is some interesting overlap as well between this overselling and a recent W3C TAG finding which takes a close look at motivations for 'inventing' URI schemes instead of re-using HTTP. The word 'inventing' seems to suggest that the URI specification discourages the use of URI schemes beyond the most popular one. Does this really only boil down to an argument of popularity?

So, here is an anecdotal story that is based part in fiction and part in fact. So, a vocabulary author within an enterprise is (at the very beginning) has a small domain in mind that she wants to build some concensus around by developing an RDF vocabulary. She doesn't have any authority with regards to web space within (or outside) the enterprise. Does she really have to stop developing her vocabulary until she has selected a base URI from which she can gurantee that something useful can be dereferenced from the URIs she mints for her terms? Is it really the case that her vocabulary has no 'semantic web' value until she does so? Why can't she use the tag scheme (for instance) to identify her terms first and then worry later about the location of the vocabulary definition. Afterall, those who push HTTP URI schemes as a panacea solution must be aware that URIs are about identification first and location second (and this latter characteristic is optional).

Over the years, I've developed an instinct to immediately question arguments that suggests a monopoly on a particular approach. This seems to be the case here. Proponents of a HTTP URI scheme monoploy for follow your nose mechanics (or auto discovery of useful RDF data) seem to suggest (quite strongly) that using anything else besides the HTTP URI scheme is bad practice, without actually saying so. So, if this is not the case, my original question remains: is it just a URI scheme popularity contest? If the argument is to make it easy for clients to build web closure then I've argued before that there are better ways to do this without stressing the protocol with brute force and unintelligent term 'sniffing'.

It seems to be a much better approach to be unambigious about the the trail left for software agents by using an explicit term (within a collection of RDF statements) to point to where more aditionally useful information can be retrieved for said collection of RDF statements. There is already decent precedent in terms such as rdfs:seeAlso and rdfs:isDefinedBy. However, these terms are very poorly defined and woefully abused (the latter term especially).

Interestingly, I was introduced to this "meme" during a thread on the W3C HCLS IG mailing list about the value of the LSID URI scheme and whether it is redundant with respect to HTTP. I believe this disconnect was part of the motivation behind the recent TAG finding: URNs, Namespaces and Registries. Proponents of a HTTP URI scheme monopoly should educate themselves (as I did) on the real problems faced by those who found it neccessary to 'invent' a URI scheme to meet needs they felt were not properly addressed by the mechanics of the HTTP protocol. They reserve that right as the URI specification does not endorse any monopolies on schemes. See: LSID Pros & Cons

Frankly, I think fixing what is broken with rdfs:isDefinedBy (and pervasive use of rdfs:seeAlso - FOAF networks do this) is sufficient for solving the problem that the Linked Data theme is trying to address, but much less heavy handedly. What we want is a way to say is:

this collection of RDF statements are 'defined' (ontologically) by these other collections of RDF statements.

Or we want to say (via rdfs:seeAlso):

with respect to this current collection of RDF statements you might want to look at this other collection

It is also worth noting the FOAF namespace URI issues which recently 'broke' Protege. It appears some OWL tools (Protege - at the time) were making the assumption that the FOAF OWL RDF graph would always be resolvable from the base namespace URI of the vocabulary: http://xmlns.com/foaf/0.1/ . At some point, recently, the namespace URI stopped serving up the OWL RDF/XML from that URI and instead served up the specification. Nowhere in the the human-readable specification (which - during that period - was what was being served up from that URI) is there a declaration that the OWL RDF/XML is served up from that URI. The only explicit link is to : http://xmlns.com/foaf/spec/20070114.rdf

However, how did Protege come to assume that it could always get the FOAF OWL RDF/XML from the base URI? I'm not sure, but the short of it was that any vocabulary which referred to FOAF (at that point) could not be read by Protege (including my foundational ontology for Computerized Patient Records - which has since moved away from using FOAF for reasons that included this break in Protege).

The problem here is that Protege should not have been making that assumption but should have (instead) only attempted to assume an OWL RDF/XML graph could be dereferenced from a URI if that URI is the object of an owl:imports statement. I.e.,

http://example.com/ont owl:imports http://xmlns.com/foaf/spec/20070114.rdf

This is unambigous as owl:imports is very explicit about what the URI at the other end points to. If you setup semantic web clients to assume they will always get something useful from the URI used within an RDF statement or that HTTP schemed URI's in an RDF statement are always resolveable then you set them up for failure or at least alot of uneccessary web crawling in random directions.

My $0.02

Chimezie Ogbuji

via Copia