Photos on the cloud, and your metadata

Every now and then I cast an eye about to see the state of the art on photo storage, sharing and backup.  Like most of us I have far more digital photos than I know what to do with.  For the most part we manage the lot on iPhoto on my wife's iMac.  It's getting to the point where iPhoto is struggling to keep up and I've pondered LightRoom, but it's still a tad bit of overkill, I think.  For now we just using the various tricks of the trade to boost performance.  I think the next step will be to move the iPhoto library to an SSD drive.  Time to start saving up!

Given my technical background, one of the biggest things I look for in photo management of all sorts is preservation of metadata.  If you are not familiar with photo metadata, you should really acquaint yourself.  It's also worth acquainting yourself as to why it's important to separate photo sharing from storage.  Whether it's the EXIF data recorded by the camera itself, or supplementary metadata added, sometimes out of band, by management apps (e.g. face matches, titles & descriptions you add yourself in iPhoto or other tools), it's really important that software respect what's there as much as possible, adding layers of metadata non-destructively.

Alas this is one area where cloud photo services fail miserably.  I think the most pernicious case of this is Dropbox, which is such a handy service for the most part, but I think is nothing short of evil with regard to photos.  First of all it is loud and persistent in pestering you to switch to its photo import and storage module every time you connect a memory card or such to your computer (I understand: they want to nudge people in a direction that leads to paying more for storage.)  The problem is that if you make the mistake of succumbing to their come-ons, you'll find that they happily mangle and destroy any photo metadata that precedes them.  The comments on their blog entries about the photo features are full of customers complaining about this abuse, but they don't seem to be listening.  They are not alone.  Google Picassa also mangles metadata.  Facebook surprises me by actually trying to do the right thing, and getting a bit tied up in knots as a result.

For now I'm sticking with iPhoto, and I'll copy photos from there to Dropbox, Facebook, etc. as needed for sharing.  I'm also trying out AeroFS, and hoping for good things from them, from the general perspective of meddling-free file distribution and sharing.  I hope more people get familiar with the issues here (there are real consequences to having your photo metadata mangled), and that it adds up to a voice in the marketplace for better solutions, including on the cloud.

A Relational Model for FOL Persistance

A short while ago I was rather engaged in investigating the most efficient way to persist RDF on Relational Database Management Systems. One of the outcomes of this effort that I have yet to write about is a relational model for Notation 3 abstract syntax and a fully funcitoning implementation - which is now part of RDFLib's MySQL drivers.

It's written in with Soft4Science's SciWriter and seems to render natively in Firefox alone (havne't tried any other browser)

Originally, I kept coming at it from a pure Computer Science approach (programming and datastructures) but eventually had to roll my sleeves and get down to the formal logic level (i.e., the Deconstructionist, Computer Engineer approach).

Partitioning the KR Space

The first method with the most impact was seperating Assertional Box statements (statements of class membership) from the rest of the Knowledge Base. When I say Knowledge Base, I mean a 'named' aggregation of all the named graphs in an RDF database. Partitioning the Table space has a universal effect on shortening indices and reducing the average number of rows needed to be scanned for even the worts case scenario for a SQL optimizer. The nature of RDF data (at the syntactic level) is a major factor. RDF is Description Logics-oriented representation and thus relies heavily on statements of class membership.

The relational model is all about representing everything as specific relations and the 'instanciation' relationship is a perfect candidate for a database table.

Eventually, it made sense to create additional table partitions for:

  • RDF statments between resources (where the object is not an RDF Literal).
  • RDF's equivalent to EAV statements (where the object is a value or RDF Literal).

Matching Triple Patterns against these partitions can be expressed using a decision tree which accomodates every combination of RDF terms. For example, a triple pattern:

?entity foaf:name "Ikenna"

Would only require a scan through the indices for the EAV-type RDF statements (or the whole table if neccessary - but that decision is up to the underlying SQL optimizer).

Using Term Type Enumerations

The second method involves the use of the enumeration of all the term types as an additional column whose indices are also available for a SQL query optimizer. That is:

ANY_TERM = ['U','B','F','V','L']

The terms can be partitioned into the exact allowable set for certain kinds of RDF terms:

ANY_TERM = ['U','B','F','V','L']
CONTEXT_TERMS   = ['U','B','F']
IDENTIFIER_TERMS   = ['U','B']
GROUND_IDENTIFIERS = ['U']
NON_LITERALS = ['U','B','F','V']
CLASS_TERMS = ['U','B','V']
PREDICATE_NAMES = ['U','V']

NAMED_BINARY_RELATION_PREDICATES = GROUND_IDENTIFIERS
NAMED_BINARY_RELATION_OBJECTS    = ['U','B','L']

NAMED_LITERAL_PREDICATES = GROUND_IDENTIFIERS
NAMED_LITERAL_OBJECTS    = ['L']

ASSOCIATIVE_BOX_CLASSES    = GROUND_IDENTIFIERS

For example, the Object term of an EAV-type RDF statment doesn't need an associated column for the kind of term it is (the relation is explicitely defined as those RDF statements where the Object is a Literal - L)

Efficient Skolemization with Hashing

Finally. thanks to Benjamin Nowack's related efforts with ARC - a PHP-based implementation of an RDF / SPARQL storage system, Mark Nottinghams suggestion, and an earlier paper by Stephen Harris and Nicholas Gibbins: 3store: Efficient Bulk RDF Storage, a final method of using a half-hash (MD5 hash) of the RDF identifiers in the 'statement' tables was employed instead. The statements table each used an unsigned MySQL BIGint to encode the half hash in base 10 and use as foreign keys to two seperate tables:

  • A table for identifiers (with a column that enumerated the kind of identifier it was)
  • A table for literal values

The key to both tables was the 16 byte unsigned integer which represented the half-hash

This ofcourse introduces a possibility of collision (due to the reduced hash size), but by hashing the identifier along with the term type, this further dilutes the lexical space and reduces this collision risk. This latter part is still a theory I haven't formally proven (or disproven) but hope to. At the maximum volume (around 20 million RDF assertions) I can resolve a single triple pattern in 8 seconds on an SGI machine and there is no collision - the implementation includes (disabled by default) a collision detection mechanism.

The implementation includes all the magic needed to generate SQL statements to create, query, and manage indices for the tables in the relational model. It does this from a Python model that encapsulates the relational model and methods to carry out the various SQL-level actions needed by the underlying DBMS.

For me, it has satisfied my needs for an open-source maximally efficient RDBM upon which large volume RDF can be persisted, within named graphs, with the ability to persist Notation 3 formulae in a seperate manner (consistent with Notation 3 semantics).

I called the Python module FOPLRelationModel because although it is specifically a relational model for Notation 3 syntax it covers much of the requirements for the syntactic representation of First Order Logic in general.

Chimezie Ogbuji

via Copia

My Definition of Semi-structured Data

Dissatisfied with the definitions of Semi-structured data I've found, I decided to roll for dolo:

Semi-structured data are data that are primarily not suitable for representation with the relational model. Often such data are better expressed by less restrictive forms of propositional logic which might be as simple (and open-ended) as a hieararchical model (XML for instance) or as complex as 'proper' First-order Logic.

Chimezie Ogbuji

via Copia