Finding URLs in plain text

John Gruber put in some good work to derive and test a regex to extract URLs from plain text.

"An Improved Liberal, Accurate Regex Pattern for Matching URLs"

I needed to use it today and found it needs a bit of care to translate for use in Python, especially with regard to its Unicode characters. Here is my Python version, with a super-simple harness to use Gruber's test page:

I'm not entirely sure I've translated the original with 100% fidelity, but this has worked fine for my purposes. I'm open to tweaks or suggestions, and will keep the Gist updated.

Uche Ogbuji

(Note the Copia blog historically has many posts by my brother Chimezie Ogbuji. They should be marked with the byline as such.) I'm a Nigerian-American entrepreneur, software engineer and writer who lives near Boulder, Colorado with my wife, three sons and daughter. I studied Electronic Engineering at The University of Nigeria at Nsukka, and Computer Engineering at the Milwaukee School of Engineering. I was co-founder of Fourthought, Inc. in 1998, which I ran until 2007, when I co-founded Zepheira. I'm also a poet and editor (Kin Poetry Journal & The Nervous Breakdown). In my spare time I train in AKKI Kenpo, skateboard, snowboard, play and coach soccer. Google profile

Posted