Coarser Grain Linked CRUD Data

Coarser Grain Linked Data

I've written before about my thoughts about the Linked Data movement and my concerns with its mandate regarding the kinds of URIs you use in your RDF.  I think, though well intentioned, that particular constraint is a bit too tight and doesn't consider the (hidden) expense  and distraction of forcing a producer of RDF content to ensure that all his or her URIs have a web presence. 

However, i've been thinking about what a RESTful protocol for interacting with an RDF dataset would look like, Maybe requiring HTTP URIs works better at a differernt level of granularity.

The Semantic Web is meant to be an extension of the Web.  The architecture of the web is abstracted in the REST style. 

The REST interface is designed to be efficient for large-grain hypermedia data. 

So, if the Semantic Web is an extension of the Web, then a RESTful focus of protocol interactions with RDF content should also occur at a large grain.  An abundance of dereferencable URIs as the terms of the RDF statements in a graph and the random outbound traversal along  them can lead to an unnecessary (and very redundant) load on the origin server if the consumer of the graph assumes all RDF vocabulary tokens denote things with a web presence.  The cache-ability of the REST style (due to its statelessness) does offset some of this burden, but consider the hypertext analogy of (fine grained) Linked Data.  Imagine if a browser were to automatically load all the outbound a/@href links in an (X)HTML document independent of which ones the user choses to click.  Even with the ability to cache responses to those requests, it still is an unnecessary burden on the browser (and origin server).

Now consider applying the same principle at a coarser, larger grain: the URIs of the named graphs in an RDF dataset.  The relationship between an RDF graph in an RDF dataset, the RDF document that serializes that RDF graph, and the referent (the thing identified by) of the graph URI is confusing.  However, it can be better understood if you think of the relationship between the RDF graph (a knowledge representation) and its mathematical interpretations.

The RDF document, the concrete syntax representation of an RDF graph, is what is passed around over HTTP.  The graph denotes some semantic web knowledge (interpreted via mathematical logic) and we might want to manipulate that knowledge through the request for and dispatching of RDF documents in various formats: N3, Turtle, NT, Trix, Trig, RDF/XML over HTTP.

If I wanted to implement a Facebook social network as a semantic web, I would store all the knowledge / information about a person in an RDF graph so it can be managed over the hypermedia protocol of the web.  Higher-order RDF vocabularies such as OWL2-RL, RDFS, OWL2-DL, etc. can be used to describe Facebook content as a complex domain.  The domain can describe what a Facebook account holder would store about the things he has asserted knows, and likes relationships against for instance.  So, I'd definitely want to be able to get useful information from requests and responses over HTTP against Facebook account identifiers.  The transitive (Web) closure of such a facebook graph up to a certain recursion depth would be useful to have along a fb:knows predicate.  This is an example of a graph link that is useful to a web agent capable of interpreting RDF content: a semantic web agent.

The hypertext analogy of coarse-grained Linked Data would basically be your current experience browsing Facebook content in your browser: you (for the most part) see updates regarding only the people you know alone, not everything that was said by anybody in the complete facebook dataset.  However, the RDF people known by those you know, and RDF things described by friends of yours and their friends, are useful and worth an attempt to 'interpret' in order to determine (for instance) how to display them in the browser or other such entailments from description of content. 
4 responses
Thoughts of an architecture for SW "scutters"
As you know I'm on the periphery of SemWeb these days, but from my understanding of LOD, isn't their approach to avoid the large footprint problem you correctly identify of the transitive closure of fine-grained links, to use fragments at the lowest levels of detail? I suppose the idea would be to "hash not slash" at that level of granularity so that same-origin behavior reduces load?

Even though I'm one of the crew behind PURL2 (see e.g. [1]), my gut feel has always been that I'd want to use PURLs for broad and durable concepts, and I especially tend to use PURLs for resource and concept types.

I've pondered a convention of, say, using a PURL for the broad concept of a clinical trial (http://purl.org/com/example/clinical-trial) and then using fragments in a local identifier space (e.g. FDA) to identify each trial (http://purl.org/com/example/clinical-trial#NCT01086059) This is just a conceptual example. I'm not taking up the question of whether a clinical trial is in itself a big enough concept for a PURL of its own.

Anyway, you do raise good points for discussion, and continued exploration, as usual.

[1] http://freegovinfo.info/node/2971

Well, the LOD approach currently encourages usage that causes the
large footprint problem. The use of hashed identifiers offsets some
of the problem by allowing a browser or intermediary to cache the
entire representation so subsequent URI's with the same base but
different fragment will be able to retrieve a cache. However, that is
only manageable if you have a few vocabularies (foaf, dc, for
instance) so only a few initial loads are necessary. If you have a
significant number of vocabularies being used in your instance data,
you might need to pay at least the initial cost of retrieving the base
document for each vocabulary (whether or not the terms being used in the instance RDF graph are relevant to what the agent is doing with
the graph). Also, if the representation is large you still pay the
price of fetching it even if it is only the first time and it is not the case that fetching the entire vocabulary is necessary to glean useful information from the instance data.

For instance, you don't need all of the FOAF ontology to know foaf:knows is a transitive property, all you need is:


foaf:knows a owl:TransitiveProperty


I've pondered a convention of, say, using a PURL for the broad concept
of a clinical trial (http://purl.org/com/example/clinical-trial) and
then using fragments in a local identifier space (e.g. FDA) to
identify each trial
(http://purl.org/com/example/clinical-trial#NCT01086059)

Yes, this is the standard hash approach to allocating web space to RDF terms. By itself it appeals to the cacheability of REST, but it is
the cumulative use of multiple vocabularies like this and expecting
the lookup of the URIs to be the primary way the consuming web agent can glean useful information is what worries me. It can plausibly be used as a DOS attack via naive linked data agents, for instance.

Well, the LOD approach currently encourages usage that causes the
large footprint problem. The use of hashed identifiers offsets some
of the problem by allowing a browser or intermediary to cache the
entire representation so subsequent URI's with the same base but
different fragment will be able to retrieve a cache. However, that is
only manageable if you have a few vocabularies (foaf, dc, for
instance) so only two initial loads are necessary. If you have a
significant number of vocabularies being used in your instance data,
you will need to pay at least the initial cost of retrieving the base
document for each vocabulary (whether or not the terms being used in
the instance RDF graph are relevant to what the agent is doing with
the graph). Also, if the representation is large you still pay the
price of fetching it.


I've pondered a convention of, say, using a PURL for the broad concept
of a clinical trial (http://purl.org/com/example/clinical-trial) and
then using fragments in a local identifier space (e.g. FDA) to
identify each trial
(http://purl.org/com/example/clinical-trial#NCT01086059)


Yes, this is the standard hash approach to allocating web space to RDF
terms. By itself it appeals to the cacheability of REST, but it is
the cumulative use of multiple vocabularies like this and expecting
the lookup of the URIs to be the primary way the consuming web agent
can glean useful information is what worries me. It can plausibly be
used as a DOS attack via naive linked data agents, for instance.