Putting up with Javascript

Sylvain Hellegouarch wrote in "Client side application frustration"

I don't want to start a flame or whatever so I'll only say that Javascript is far from being THE solution for clientside application IMO. Don't get me wrong, the language itself is interesting and powerful enough for most things you might want to do, but the environment (the browsers) in which it sits is quite not there yet.

It looks like Javascript has never been seen as something browser vendors should pay much attention. So they did implement what the ECMA group had specified but with some variations which doesn't allow the developer to ensure everything will work the same way. For too long Javascript was seen as a toy for doing rollover menus or spamming you with popups all over the place.

He was trying to write a Weblog engine with a lot of the code in the browser and ran into endless portability problems, as well as security concerns and lack of friendly developer tools.

He also tried looking beyond Javascript.

Then I decided to look at SVG, XForm and the WHATWG effort but again I felt frustrated. Those are very well specified but have such a bad support (or none) amongst browsers that they are almost unusable. Should I come back to Java applets? sigh

And he finishes with a question:

So my question now is about the future of client side development, we do have a lot of different choices but there are no clear vision for the average joe like me on which one to focus on neither on how they will cooperate in the future nor what will be the support amongst browsers. Maybe people in the know could clarify things?

Kurt Cagle took up the question in his piece "Is Javascript Dead?", providing some historical perspective and exploring some code patterns as a way to illustrate the language's flexibility.

Javascript has traditionally been considered a lightweight language - useful for web pages, but not something appropriate for larger scale applications. There is some justification for this viewpoint - Javascript is almost invariably interpreted, meaning that it cannot optimize content cleanly prior to running it. While it does have classes (in the sense of functional prototypes), Javascript has no obvious concept of a package, which means that higer order organization can prove complicated and likely to impinge upon namespaces. There are, however, ways around this.

Kurt goes on to show an example of Javascript's elegance by turning what most people write as gruesome forms element processing logic completely inside out, and mapping events quite naturally to classes and functions. He concludes:

I think the final point of all of this is that, far from being too primitive for use with existing technology, Javascript is in point of fact still very relevent, and if anything is finally reaching a point where it can hold its own against just about every other programming language out there, at least within the domain of the manipulation of DOMs (or XML). Given the incipient emergence of E4X, I fully expect Javascript to become one of the predominent languages on the planet.

Kurt doesn't touch on Sylvain's concerns about portability, security and development environment, and those are still matters that need serious discussion (for one thing, they are issues that have always interfered with my productivity in Web development). I see Kurt's post as a bit of advocacy to underscore for developers why some of the pain is worth enduring. I look forward to more posts on the topic.

[Uche Ogbuji]

via Copia

Beyond HTML tidy, or "Are you a chef? 'Cause you keep feeding me soup."

In my last entry I presented a bit of code to turn Amara XML toolkit into a super duper HTML slurper creating XHTML data binding objects. Tidy was the weapon. Well, ya'll readers wasted no time pimping me the Soups. First John Cowan mentioned his TagSoup. I hadn't considered it because it's a Java tool, and I was working in Python. But I'd ended up using Tidy through the command line anyway, so TagSoup should be worth a look.

And hells yeah, it is! It's easy to use, mad fast, and handles all the pages that were tripping up Tidy for me. I was able to very easily update Amara's tidy.py demo to use Tagsoup, if available. Making it available on my Linux box was a simple matter of:

wget http://mercury.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0rc3.jar
ln -s tagsoup-1.0rc3.jar tagsoup.jar

That's all. Thanks, John.

Next up Dethe Elza asked about BeautifulSoup. As I mentioned in "Wrestling HTML", I haven't done much with this package because it's more of a pull/scrape approach, and I tend to prefer having a fully cleaned up XHTML to work with. But to be fair, my last extract-the-mp3-links example was precisely the sort of case where pull/scrape is OK, so I thought I'd get my feet wet with BeautifulSoup by writing an equivalent to that code snippet.

import re
import urllib
from BeautifulSoup import BeautifulSoup
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
stream = urllib.urlopen(url)
soup = BeautifulSoup(stream)
for incident in soup('a', {'href' : re.compile('\\..*mp3$')}):
    print incident['href']

Very nice. I wonder how far that little XPath-like convention goes.

In a preëmptive move, I'll mention Danny's own brand of soup, psoup. Maybe I'll have some time to give that a whirl, soon.

It's good to have alternatives, especially when dealing with madness on the order of our Web of tag soup.

And BTW, for the non-hip-hop headz, the title quote is by the female player in the old Positive K hit "I Got a Man" (What's your man gotta do with me?..."

I gotta ask you a question, troop:
Are you a chef? 'Cause you keep feeding me soup.

Hmm. Does that count as a Quotīdiē?

[Uche Ogbuji]

via Copia

Use Amara to parse/process (almost) any HTML

It's always nice when a client obligation indirectly feeds a FOSS project. I wanted to do some Web scraping to do recently while doing the day job thingie. As with most people who do what I do these days it's a common task, and I'd already done some exploring of the Python tool base for this in "Wrestling HTML". In that article I touched on tidy and its best known Python wrapper uTidyLib. One can use these to turn zany HTML into fairly clean XHTML. In the most recent task I, however, had a lot of complex processing to do with the resulting pages, and I really wanted the flexibility of Amara Bindery, so I cooked up some code (much simpler than I'd expected) to use the command-line tidy program to turn arbitrary Web pages into XHTML in the form of Amara bindery objects.

I just checked this code in as an Amara demo, tidy.py. As an example of its usage, here is some Python script I wrote to list all the mp3s links from a given Web page, (for easy download with wget):

from tidy import tidy_bind_url #needs tidy.py Amara demo
url = "http://webjay.org/by/chromegat/theclassicnaijajukebox2823229"
doc = tidy_bind_url(url)
#Display all links to mp3s (by file extension check)
for link in doc.xml_xpath(u'//html:a[@href]'):
    if link.href.endswith(u'.mp3'):
        print link.href

The handy thing about Amara even in this simple example is how I was am to take advantage of the full power of XPath for the basic query, and then shunt in Python where XPath falls short (there's a starts-with function in XPath 1.0 but for some reason no ends-with). See tidy.py for more sample code.

Tidy does choke on some abjectly broken HTML pages, but it has done the trick for me 90% of the time.

Meanwhile, I've been meaning to release Amara 1.0. I haven't needed to make many changes since the most recent beta, and it's pretty much ready (and I need to get on to some cool new stuff in a 2.0 branch). A heavy workload has held me up, but perhaps this weekend.

[Uche Ogbuji]

via Copia

OPML, XOXO, RDF and more on outlining for note-taking

There has been a lot of good comment on my earlier entry on outline formats in XML. I've been pretty busy the past week or so, but I'd better get my thoughts down before they deliquesce.

Bob DuCharme pointed me at Micah's article which includes mention of XOXO. Henri Sivonen asked what I think of it.

Taking the name "outlining format" literally, it's actually just fine. As Micah says:

Some people might feel warmer and fuzzier with elements named outline, topic, item, and so on, or with elements in a freshly minted namespace, but microformats can still claim the semantic high ground, even when reusing XHTML. In the above, the parts of an outline are ordered lists and list items, exactly as the XHTML element names say.

The problem is that what made me start looking into outlining formats was the fact that I'd heard from others that these make such a great format for personal information space organization, and XOXO is just about useless in that regard.

Along that vector, I wonder what a pure outline format is useful for, anyway? I can't remember having ever needed a stand-alone outline document separate from what I'm outlining. If I'm writing a presentation or a long article, I'd prefer to have the table of contents or presentation outline section generated from the titles and structure of the full work. Sure, XOXO might be suitable for such a generated outline, but my exploration is really about hand editing.

In short I think XOXO is just fine for outlining, and yet I can't imagine when I'd ever use it. As others have mentioned, and as I suspected, the entire idea of outlining formats for general note-taking is a big stretch. Danny Ayers mentioned in a comment on the earlier format that for some attraction to OPML is a matter of neat outlining UIs. I've always been conservative in adopting UIs. I use emacs plus the command line for most of my coding, and after trying out a half dozen blog posting tool for posting to Copia, I ended up writing an e-mail-to-post gateway so that I can enter text (markdown) into a UI I'm already familiar with, Evolution's e-mail composition window.

As I said in the earlier entry, full-blown XHTML 2.0 makes more sense than an outlining format for managing a personal information space, and yet it seems too weak to me for this purpose. The weakness, as Danny points out, is semantic. If everything in my personal information space is just a para or an anchor or a list, I'll quickly get lost. As followers of Copia know, my brain is a rat trap of wandering thoughts, and I'm a poster child for the need for clearly expressed semantics.

As an RDF pioneer, I'm happy to use ideas from RDF, but I do not want to type RDF/XML by hand. I've always argued, as Danny Ayers hinted, that RDF should strive hard to by syntax agnostic, especially because RDF/XML is awful syntax. I agree with him that GRDDL is a good way to help rescue XHTML microformats from their semantic soup, and I think this is a better approach than trying to shovel all the metadata into the XHTML header (Dan Brickley mentions this possibility, but I wonder whether he tends to prefer it to GRDDL). GRDDL has a natural draw for me since I've been working with and writing tools for the XML+XSLT=RDF approach for about four years. But when I'm using markup for markup (e.g. in a personal information space) I'd rather have semantic transparency fitting comfortably within the markup, rather than dangling off it as an afterthought. In a nutshell, I want to use the better markup design of:

<to-do>

rather than the kludge of:

<ul class="to-do">

I think there's little excuse for why we don't have the best of both worlds. People should be able to enjoy the relative semantic cleanliness of RDF, but within the simplest constructs of markup, without having to endure the added layer of abstraction of RDF. That added layer of abstraction should only be for those aggregating models. The fact that people would have to pay the "RDF tax" every time they start to scribble in some markup explains why so many markup types dislike RDF. I'm not sure I've found as clear a case for this point than this discussion of extended uses for outlining formats.

Microformats are generally a semantic mess, from what I've seen of them. They do best when they just borrow the semantics of existing formats, as XOXO does, but I think they're not the solution to lightweight-syntax +clean-semantics that the GRDDL pioneers hope. GRDDL has too much work to do in bringing the rigor of RDF to microformats, and this work should be part of the format itself, not something like GRDDL. I think the missed opportunity here is that XML schema systems cling so stubbornly to syntax-only-syntax. As I've been exploring in articles such as "Use data dictionary links for XML and Web services schemata" (I have a more in-depth look at this in the upcoming Thinking XML article), one can make almost all the gains of RDF by putting the work into the XML schema, rather than heaping the abstraction directly into the XML format. And the schema is where such sophistication belongs.

But back to outlining and personal information spaces, I've tried the personal Wiki approach, and it doesn't work for me. Again Danny nails it: Wiki nodes and links are untyped. This is actually similar to the problem that I have with XHTML, but Wikis are even more of a semantic shambles. In XHTML there is at least a bit of an escape with class="foo". The difficulty of navigating and managing Wikis increases at a much greater rate than their information content, in my experience. My Akara project was in effect an attempt at a more semantically transparent Wiki, but since I wrote that paper I've had almost no time for Akara, unfortunately. I do plan to make it the showcase application for my vision of 4Suite 2.0, and in doing so I have an ally Luis Miguel Morillas, so there is still hope for Akara, perhaps even more so if I am able to build on Rhizome, which might help eliminate some wheel reinvention.

[Uche Ogbuji]

via Copia

Flickr, creative commons and "borrowing" works

I happened to stumble across a very interesting thread on the Flickr board: "HELP: Somebody's using MY pictures in HIS Photostream!!!". The thread opens up all sorts of pitfalls I'd never even thought of with Weblogging, shared photos, Creative Commons, etc.

As best I can summarize the situation, the most salient facts are:

  1. UserA complained that UserB uploaded a couple of UserA's photos into UserB's account ("photostream") without UserA's permission.
  2. UserA has placed the photos under Attribution-NonCommercial-NoDerivs
  3. UserB did not provide attribution to UserA
  4. UserA does not want UserB to host these pictures under UserB's account regardless of attribution
  5. If UserB were to provide attribution, because of the many ways people can link to or embed Flickr photos, it's possible that the attribution will not be apparent in certain normal usage

Some less salient, but interesting facts:

  • 10) UserA is distressed and asked for help, officially and unofficially, in part how to deal with the unwanted usage, and for better understanding of policy and convention
  • 11) UserA says that English is not his first language, and he doesn't necessarily understand everything that's going on
  • 12) UserA feels that UserB is being "a bit of a jerk" about the entire situation but UserB has consented to update with attribution [Note drifting from fact to characterization: based on UserB's own comments (he appears in the thread later on), I think the problem is misunderstanding followed by standard flame war social dynamics]

The most striking thing here is the contradiction between Fact (3) and (4). It seems that UserA does not understand that in using this particular CC license, he loses control over (4), providing that licensees adhere to terms (which according to (12) should be the case by now). Is there a problem with licensors nonchalantly choosing CC licenses without considering or understanding all the ramifications of thee licenses? (I'm sure there is always going to be some degree of such problems, but how widespread are they among the many people who are not used to thinking in terms of Copyright licensing?)

Is there a problem with the fact that Flickr makes it easy to tag with attribution-required licenses, even though the way the service is structured cannot really enforce or ensure attribution (5)?

To what extent are CC licenses applied using technological metadata tagging means legally trumped by separately and explicitly stated means of the licensor? In this case, UserA has tagged his photos with a license, but has also informally expressed a desire for more stringent restrictions than those expressed in the tags. What restrictions are legally enforceable in this case?

One matter that came up is pretty much an entirely separate discussion on its own. It seems that Flickr purposely does not allow people to specify photos as public domain, for reasons that really seem a bit fuzzy to me.

Based on reading Flickr staff responses in this thread, they really don't know answers to such questions any more than I do. Some seemed remarkably muddled in their responses, and some gave responses that I think are plain wrong. Most of these discrepancies do get hashed out in the thread, though, and I hasten to add that Flickr staff seem to be genuinely concerned about sorting this all out based on what I read. I suspect that the entire situation just dredges up a ton of issues regarding the intellectual commons that no one really fully understands yet.

[Uche Ogbuji]

via Copia

Wow! XML formats for outlining are complete rubbish

There's no other way to put it other than the title. After having heard a lot about OPML (and having used it as a blind RSS feed exchange format), Ian Forrester's comment that he is using OPML for all his personal note-taking finally pushed me to look seriously at the format. It is just complete and utter garbage. OPML might possibly be the best example of the worst abuses of XML markup. It's really hard to fully express how horrible OPML is, and there is no way in the world that I'll ever be dealing with it directly. The language that I use as a hub for my personal information space needn't be perfect, but it can't make me gag at every other tag.

I looked a bit further and found OML. It's a reaction to the ugliness of OPML and so I expected it would be the ticket for me. It does partially fix perhaps the most immediate and visceral abomination of OPML: the abuse of attributes for prosaic textual content (although why it doesn't completely eliminate the text attribute in favor of a title child element is beyond me). But it leaves a lot of nastiness and introduces some of its own (the idea of item as generic extensibility element is hugely ill-begotten). OML isn't even widely-used enough to just compromise and deal with its flaws. I think I'll consider creating my own language. I can export to OPML via XSLT when I really have to. But I think I can use some of the fixes in OML as a starting point.

"Sharing, the web way", by Danny Ayers is a good outline [n/m] of the horrors of OPML. He does get into the politics as well, which I think are less important to me than the technical flaws. He does state what has always been my reaction to the OPML hype:

But more and more I'm thinking things like blogrolls or whatever are much better handled using something more specific - XBEL or simply (X)HTML (like Netscape bookmarks).

Of course I'd plump for XBEL, but this expresses my general viewpoint. I wanted to look at outlining formats because so many people go on about outline editors and formats as productivity tools. I want to be sure I'm not missing anything. Based on what I've found so far, I'm really confused at what people are gaining in this space.

Danny goes on to say:

If what you want is versatility and be able to combine material from different sources and of potentially different data types, then you really do need something like RDF - the glue of OPML isn't strong enough.

I think that O*ML is just one of those examples that illustrate the limits of RDF. RDF is not the best model for content with a significant prose quotient, and RDF/XML not the best syntax. I think that a format I came up with would have the following characteristics:

  • Lightweight overall framework using other formats such as XBEL and XHTML 2.0 to do the heavy lifting
  • Sound XML design overall
  • Metadata sections that can readily be mapped to RDF model without using RDF/XML directly

Where would that take me that no one else has already charted? Maybe nowhere. I certainly wouldn't plan to spend much time on it (and I might not even bother at all). It's just a bit of exploration suggested to me by what I've found poking around what's there already.

[Uche Ogbuji]

via Copia

"Untangle URIs, URLs, and URNs"

"Untangle URIs, URLs, and URNs"—Dan Connolly

In information management, persistence and availability are in constant tension. This tension has led to separate technologies for Uniform Resource Names (URNs) and Uniform Resource Locators (URLs). Meanwhile, Uniform Resource Identifiers (URIs) are designed to serve as both persistent names and available locations. This article explains how to use the current URI standards with XML technologies, gives a history of URNs and URLs, and provides a perspective on the tension between persistence and availability.

I suggested this article idea to John Swanson, IBM developerWorks XML zone editor, but I thought that Connolly would be a much better person to write on the topic than I would. Nice to see it all published, and I'm glad to have a good resource to share with people when I'm asked the perennial question "what's the difference between a URL and a URI?"

Chimezie Ogbuji

via Copia

Lektora is nice, but not nice enough

Lektora, a browser-based feed reader, is nice, and for a moment I considered buying the commercial upgrade, not because I have a problem with the ads (I don't notice them), but to support good software. And then I started noticing missing feeds. I imported my entire OPML from straw, and it turns out that Lektora never presents me with items from a lot of those blogs. Some of the missing ones I've noticed:

Nothing I can tell in common with these feeds. One is Atom, two are RSS 1.0, and the third is RSS 2.0. But Lektora apparently can't handle them, and I have to use straw to keep up with them.

Also, Lektora developers seem to do a good job keeping the Windows and Linux versions up to date, but don't bother much with the Linux version. I'm not willing to pay for half-hearted Linux support.

[Uche Ogbuji]

via Copia