Finding URLs in plain text

John Gruber put in some good work to derive and test a regex to extract URLs from plain text.

"An Improved Liberal, Accurate Regex Pattern for Matching URLs"

I needed to use it today and found it needs a bit of care to translate for use in Python, especially with regard to its Unicode characters.  Here is my Python version, with a super-simple harness to use Gruber's test page:

I'm not entirely sure I've translated the original with 100% fidelity, but this has worked fine for my purposes.  I'm open to tweaks or suggestions, and will keep the Gist updated.

Store-agnostic REGEX Matching and Thread-safe Transactional Support in rdflib

[by Chimezie Ogbuji]

rdflib now has (checked into svn trunk) support for REGEX matching of RDF terms and thread-safe transactional support. The transactional wrapper provides Atomicity, Isolation, but not Durability (a list of reversal RDF operations is stored on the live instance - so they won't survive a system failure). The store implementation is responsible for Consistency.

The REGEX wrapper provides a REGEXTerm which can be used in any of the RDF term 'slots' with:

It replaces any REGEX term with a wildcard (None) and performs the REGEX match after the query invokation is dispatched to the store implementation it is wrapping.

Both are meant to work with a live instance of an RDF Store, but behave as a proxy for the store (providing REGEX and/or transactional support).

For example:

from rdflib.Graph import ConjunctiveGraph, Graph
from rdflib.store.REGEXMatching import REGEXTerm, REGEXMatching
from rdflib.store.AuditableStorage import AuditableStorage
from rdflib.store import Store
from rdflib import plugin, URIRef, Literal, BNode, RDF

store = plugin.get('IOMemory',Store)()
regexStorage = REGEXMatching(store)
txRegex =  AuditableStorage(regexStorage)
g=Graph(txRegex,identifier=URIRef('http://del.icio.us/rss/chimezie'))
g.load("http://del.icio.us/rss/chimezie")
print len(g),[t for t in g.triples((REGEXTerm('.*zie$'),None,None))]
g.rollback()
print len(g),[t for t in g]

Results in:

492 [(u'http://del.icio.us/chimezie', u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type', u'http://purl.org/rss/1.0/channel'), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/link', u'http://del.icio.us/chimezie'), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/items', u'QQxcRclE1'), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/description', u''), (u'http://del.icio.us/chimezie', u'http://purl.org/rss/1.0/title', u'del.icio.us/chimezie')] 0 []

[Chimezie Ogbuji]

via Copia