Finding URLs in plain text

John Gruber put in some good work to derive and test a regex to extract URLs from plain text.

"An Improved Liberal, Accurate Regex Pattern for Matching URLs"

I needed to use it today and found it needs a bit of care to translate for use in Python, especially with regard to its Unicode characters.  Here is my Python version, with a super-simple harness to use Gruber's test page:

I'm not entirely sure I've translated the original with 100% fidelity, but this has worked fine for my purposes.  I'm open to tweaks or suggestions, and will keep the Gist updated.

Store-agnostic REGEX Matching and Thread-safe Transactional Support in rdflib

[by Chimezie Ogbuji]

rdflib now has (checked into svn trunk) support for REGEX matching of RDF terms and thread-safe transactional support. The transactional wrapper provides Atomicity, Isolation, but not Durability (a list of reversal RDF operations is stored on the live instance - so they won't survive a system failure). The store implementation is responsible for Consistency.

The REGEX wrapper provides a REGEXTerm which can be used in any of the RDF term 'slots' with:

It replaces any REGEX term with a wildcard (None) and performs the REGEX match after the query invokation is dispatched to the store implementation it is wrapping.

Both are meant to work with a live instance of an RDF Store, but behave as a proxy for the store (providing REGEX and/or transactional support).

For example:

from rdflib.Graph import ConjunctiveGraph, Graph
from import REGEXTerm, REGEXMatching
from import AuditableStorage
from import Store
from rdflib import plugin, URIRef, Literal, BNode, RDF

store = plugin.get('IOMemory',Store)()
regexStorage = REGEXMatching(store)
txRegex =  AuditableStorage(regexStorage)
print len(g),[t for t in g.triples((REGEXTerm('.*zie$'),None,None))]
print len(g),[t for t in g]

Results in:

492 [(u'', u'', u''), (u'', u'', u''), (u'', u'', u'QQxcRclE1'), (u'', u'', u''), (u'', u'', u'')] 0 []

[Chimezie Ogbuji]

via Copia