Copia housecleaning

I've finally had some time today, as I prepare for the holidays, to fix some things on Copia that have been broken for too long. Some of the highlights, especially concerning issues mentioned by readers (thanks, guys), are:

RSS 1.0 feed body fix. Added rss:description field for the RSS 1.0 feed, which fixes missing post bodies in readers such as Bloglines which don't support support content:encoded. I do truncate the field to 500 characters, according to the recommendation in the spec.

Single entry view title fix. Added entry titles for single entry pages. Before today, if you viewed this entry through the perma-link, the title would just say "Copia"; now it says "Copia ✏Copia housecleaning"). I've wanted to do this for a while, but I was having the devil of a time figuring out how to do it with PyBlosxom. A scolding from Dan Connolly forced me to chase down a fix. For other PyBlosxom users the trick is to use the comments plug-in, copy the head.* flavor file to comment-head.*, and then update to use the $title variable, which is the title of the entry itself ($blog-title is the title of the entire blog). In my case the updated HTML header template looks like:

<title>$blog_title &#x270F;$title</title>

I did get a report that Copia is incorrectly sending `Content-Type header text/html;charset=ISO-8859-1`, but when I check using the LiveHTTPHeaders extension for FireFox on Linux it reports the correct charset=UTF-8 from the server. If anyone else can corroborate this issue, please leave a comment with the specific URL from which you noticed the error, your platform and browser, and the HTTP sniffing tool were you using. Thanks.

[Uche Ogbuji]

via Copia

Ouch. I feel your pain, Sam

Seems Sam Ruby's presentation suffered a bit in the spectacle. Unfortunately, I'm no stranger to presentation set-up problems. I've also been lucky enough to have patient audiences. Maybe conference organizers will someday factor Linux A/V support into consideration when choosing venues (I can dream, eh?). I almost always can use projectors and stuff with no problem in usual business scenarios, and I can only guess that conference venues tend to have archaic A/V technology that doesn't like Linux.

As for the presentation itself, based on the slides much of it is an accumulation of issues probably well known to, say, a long-time XML-DEV reader, but useful to collect in one place. It looks like a much-needed presentation, and I hope Sam gets to present it again, with better luck with the facilities. Here follow a few reactions I had to stuff in the slides.

expat only understands utf-8

This hasn't been true for ages. Expat currently understands UTF-8, UTF-16, ASCII, ISO-8859-1, out of the box, and the user can add to this list by registering an "unknown encoding" event handler.

Encoding was routinely ignored by most of the initial RSS parsers and even the initial UserLand RSS validator. “Aggregators” did the equivalent of strcat from various sources and left the results for the browser

Yuck. Unfortunately, I worry that Mark Pilgrim's Universal Feed Parser might not help the situation with its current practice of returning some character data as strings without even guessed encoding information (that I could find, anyway). I found it very hard to build a character-correct aggregator around the Feed Parser 4.0 alpha version. Then again, I understand it's a hard problem with all the character soup ("char soup"?) Web feeds out there.

[Buried] in a non-normative appendix, there is an indication that the encoding specified in an XML document may not be authoritative.

Nope. There is no burial going on. As I thought I've pointed out on Copia before (but I can't find the entry now), section " 4.3.3 Character Encoding in Entities" of XML 1.0 says:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

So the normative part of the spec also makes it quite clear that an externally specified encoding can trump what's in the XML or text declaration.

The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata.

Very apt. I think that's why XML's attributes work as well as they do (despite the fact that they are so inexplicably maligned in some quarters).

In fact, Microsoft’s new policy is that they will always completely ignore [HTTP Content-Type] for feeds—even when the charset is explicitly present

XML of course doesn't force anyone to conform to RFC 3023, but Microsoft could prove itself a really good Web citizen by adopting it. Maybe they could lead the way to reducing the confusion I mention in this entry.

I think of Ruby's section on the WS-* mess to be an excellent indictment of the silly idea of universal and strong data typing.

In general, programming XML is hard.

Indeed it is. Some people seem to think this is a product of architecture astronautics. They are laughably mistaken. XML is hard because managing data is hard. Programmers have developed terrible habits through long years of just throwing their data over the wall at a SQL DBMS and hoping all goes OK in the end. The Web is ruthless in punishing such diffidence.

XML is the first technology that has forced mainstream programmers to truly have to think hard about data. This is a boundlessly good thing. Let the annoyances proliferate (that's your cue, Micah).

[Uche Ogbuji]

via Copia

Suspicions confirmed, so I'm ditching Rojo

Rojo was a pretty nice UI, but I always had a nagging suspicion it was showing me only a small number of my feeds. I finally got time to check on that last this morning. I loaded Liferea (Aristotle's suggestion) with the same feeds list I used for Rojo. I found numerous feeds with recent entries, which Rojo wasn't showing me. The worst part of it is that Rojo wasn't giving any indication that it was having trouble with some feeds (I did notice that the number unread in the left hand bar didn't match the number in the folder newspaper view, but I could never figure out how to get at exactly what that meant). Liferea is a bit clunky, but at least I don't miss entries using it. I guess I'll use it until I find something better.

[Uche Ogbuji]

via Copia

XOXO versus Atom versus XBEL for Web feed lists?

Very useful response to my quest for XOXO understanding. Wow. In fact, I've had so much good discussion here and on the #atom IRC that I'm not sure where to start.

I'll start with what my current leaning is, after all that discussion. I think I want to use XBEL for managing my bookmarks, and I'll advocate the necessary extensions, and hope they are as well received as XBEL itself has been.

On to my general thoughts.

XOXO. I was pointed to a thread in which my use of rel="webfeed" was rightly deprecated. I was just grasping for something appropriate in XHTML space, and I don't think there really is. I got into an argument about whether type="application/atom+xml" is the way to go, and I'm still not fully convinced it is. The way it works for me to think about it is that for each item I have zero or more links I want to consider human-readable content links, and zero or more I want to consider feed links. It's my arbitrary decision how to decide which is which, but media type alone doesn't tell the tale. As an example, what if someone has a Weblog that is served as custom XML with XSLT to render it in my browser. The media type would be application/xml and yet it would be a content link. I see the weakness in my own argument: after all, a user agent can also make an Atom document nicely rendered for a human reader, so the distinction I'm trying to make might be an impossible one regardless of method. Maybe I'll have to cave in to the "use type" argument. I'll try it out in this entry.

So back to my first format example for XOXO. If I throw in examples of "folders", I come up with:

<ol class="xoxo">
  <li>
    <p>Technology</p>
    <ol>
      <li>
        <ul>
          <li>
            Weblog home
            Weblog feed
            <dl>
              <dt>description</dt>
              <dd>That good ole Weblog</dd>
            </dl>
          </li>
        </ul>
      </li>
    </ol>
  </li>
</ol>

Umm. That's a round mouthful of tags. I'm not sure I really want to have to squint at that. I also don't like how <description>...</description> becomes <dl><dt>description</dt><dd>...</dd></dl>. That's too much markup indirection for me. Even worse that the usual reduction ad absurdum of <element name="description">....

Atom. Aristotle and then Mark suggested using Atom for Web feed list exchange. It's one of those "DUH!" moments. I can't believe it didn't occur to me. The main problem I have with Atom is that is really only offers one level of hierarchy: feed/entry. My Web feed list is hierarchical. The ready solution is to use categories to simulate hierarchies.

<entry>
  <id>http://example.com/weblog/</id>
  <updated>2005-05-23T15:38:00-08:00</updated>
  <title>Example Weblog</title>
  <link rel="self" href="http://example.com/weblog/" type="text/html">Weblog home</link>
  <link rel="alternate" href="http://example.com/weblog/atom" type="application/atom+xml">Weblog feed</link>
  <summary>That good ole Weblog</summary>
  <category term="Technology"/>
</entry>

I'd also need some way to separately express the hierarchy of categories. I might also have to use use the ranking extension (see for example this article) to preserve item order, if I care about it. I don't know. This looks a bit of a stretch. It certainly wouldn't be very friendly to edit by hand. The fact that the updated element is mandated, for one thing, tends to color Atom into a machine-generated-only lines. This is OK for a lot of Atom's use cases, but I think it's a real bummer for my present one.

XBEL. XBEL is just the little XML format engine that could. Despite its great age and recent neglect, it's possibly the most widely deployed of these options, because of its use in Browser bookmark formats. Not that I think that's any reason for preferring XBEL. I do like how it looks, though:

<folder>
  <title>Technology</title>
  <bookmark href="http://example.com/weblog/">
    <title>Example Weblog</title>
    <info>
      <metadata owner="webfeed">
        <link href="http://example.com/weblog/atom" type="application/atom+xml"/>
        <description>That good ole Weblog</description>
      </metadata>
    </info>
  </bookmark>
</folder>

Even with the required info/metadata layer, it's much cleaner than the other two options. I think that we could at least get into XBEL 1.1 native elements for alternate links and for bookmark descriptions (which I think were already proposed by others), so we could perhaps eliminate the info/metadata layer in XBEL 1.1. Regardless, I think I'll manage things for myself in XBEL as above and just use XSLT to convert to whatever starts t emerge as a viable option for export to other tools (or as a way to export to OPML if I have to).

[Uche Ogbuji]

via Copia

I must be missing something about XOXO (and maybe microformats in general)

As I mentioned I started working with XBEL as a way to manage my Web feeds. It occurred to me that I should consider XOXO for this purpose, since it hsa more traditionally been put up in opposition to OPML.

Well, I don't get it. Sure, it's simple XHTML with some conventions for overlaid semantics, but how does that do anything for interoperability of Webfeed subscription lists? I've taken a look at attention.xml, and that seems more thoroughly specified, but it's way overkill for my needs.

Look, all I need to do is represent a categorized structure of feeds items with the following information per item:

  1. Web feed URL (e.g. RSS or Atom link)
  2. title
  3. optional description or notes
  4. optional Web site URL (e.g. Link to HTML or XHTML page for Weblog)

The trouble with OPML is that there are dozens of ways to encode these four bits of information, and as I've found, tools pretty much range all across that dozen. Besides, OPML is really poor XML design. That's a practical and not just aesthetic concern, because I expect to manage this information partly by hand.

XBEL is much better markup design, but I don't know that it has a natural way to represent the distinction between feed and content URL for the same item ("bookmark").

Everything I heard about XOXO led me to believe that this is a slam dunk, but hardly. The XOXO "spec" is not all that illuminating, and from what I can gather there or elsewhere, there are also a dozen ways I could encode the above information. Perhaps:

<li>
  Weblog home
  Weblog feed
  <dl>
    <dt>description</dt>
    <dd>That good ole Weblog</dd>
  </dl>
</li>

Perhaps (so that the likely HTML rendering is not jumbled):

<li>
  <ul>
    <li>Weblog home</li>
    <li>Weblog feed</li>
  </ul>
  <dl>
    <dt>description</dt>
    <dd>That good ole Weblog</dd>
  </dl>
</li>

But since Weblog contents could be XML, is it really safe to use media type as the distinguishing mark between Web site and Web feed links? OK, so perhaps:

<li>
  <ul>
    <li>Weblog home</li>
    <li>Weblog feed</li>
  </ul>
  <dl>
    <dt>description</dt>
    <dd>That good ole Weblog</dd>
  </dl>
</li>

But now I've invented a relationship vocabulary (I guess this is technically my own microformat) and why would I expect another XOXO tool to use rel="website" and rel="webfeed"?

I could go on with possible variations. I do like the way that I can simply refer to the XHTML Hypertext Attributes Module to get some general ideas about semantics, but that's not really good enough because I have a fairly specific need.

I imagine someone will say that XOXO is just a general outlining format, and can't specify such things because it's all about being micro. But in that case why do people put XOXO itself as a solution for Webfeed corpus exchange? I can't see how XOXO can do the job without overlaying yet another microformat on it. And if we need to stack microformat on microformat to address such a simple need, what's wrong with good old macroformats: you know: a real, specialized XML format.

I've really only spent an hour or two exploring XOXO (although according to the microformats hype I shouldn't expect to need more time than that), so maybe I'm missing something. If so, I'd be grateful for any enlightening comments.

[Uche Ogbuji]

via Copia

I already said OPML is crap, right? I had to hack through another reminder today.

So today I tried to import OPML (yeah, that very OPML) into Findory (see last entry). The OPML is based on what I originally exported from Lektora and has been through all my feed experiments. A sample entry:

<outline url="http://www.parand.com/say/index.php/feed/" text="Parand Tony Darugar" type="link"/>

What does Findory tell me? 97 feeds rejected for "invalid source". Great. Now I actually have to get my hands dirty in OPML again. I check the spec. Of course there's no useful information there. I eventually found this Wiki of OPML conventions. I saw the type='rss' convention, but that didn't seem to make a difference. I also tried xmlUrl rather than url, like so:

<outline xmlUrl="http://www.parand.com/say/index.php/feed/" text="Parand Tony Darugar" type="link"/>

This time the Findory import works.

But not only do several of the feed readers I use have url rather than xmlUrl, but the XBEL to URL XSLT I've found assumes that name as well. The conventions page also mentions title versus text as a way to provide formatting in some vague way, but I've seen OPML feeds use only title and nary a text to be seen anywhere. Besides, what's wrong with the XML way of allowing formatting: elements rather than attributes. It's enough to boil the brain.

Speaking of XBEL, that's actually how I'm managing my feeds now, as I'll discuss further in the next entry. Now that Web feeds have become important to me I'll be using a sane format to manage them, thank you very much. I'll do the XSLT thing to export OPML for all the different tools that insist on tag soup. That is, of course, if I can figure out what precise shade of OPML each tool insists on. Today's adventure with feed URL attributes makes me wonder whether there is any escaping the chaos.

[Uche Ogbuji]

via Copia

Still looking for a feed reader, perhaps

Earlier I trawled around for a new way of reading my Web feeds. Readers were kind enough to mention Rojo in comments, and I've been using it ever since, but I'm not so sure anymore. It's very nice, but there are a couple of UI nits, and I have a sneaking suspicion it's not showing all new stories. I'm not set on ditching it yet, but I'm looking around again. I found the very useful resources "1 week comparison: SearchFox, Feedster, Pluck, Bloglines, Rojo, and NewsGator", and the follow-up "3 week shakedown, 2 RSS readers remain.". He's leaning towards SearchFox (which a Copia reader had mentioned earlier) and Rojo. Someone in his comments mentioned Findory as well. I went to www.searchfox.com and signed up, but I guess that's the wrong site. I went to rss.searchfox.com, but you have to e-mail them to get a beta account. I'll give my initial impressions on SearchFox if and when I get an account. As for Findory, the first issue is that I couldn't figure out how to import OPML. I went back to the comment from which I learned about it and found that the key URL is http://findory.com/s/, but I don't think this is very clear on their Web site.

After some OPML silliness (more on that later) I completed the import and found that clicking the "favorites" link in the upper right hand side is the key to focusing on your own feeds, and not all the other stuff Findory wants to show you. I don't think it will work for me though, because it's not the newspaper style aggregator that I prefer.

[Uche Ogbuji]

via Copia

Don't give me that monkey-ass Web 1.0, either

Musing about whether XML and RDF are too hard (viz. Mike Champion's summary of Bosworth), and whether XQuery and OWL are really the right food for better XML tools (viz: Mike Champion's summary of Florescu), my first reaction was to the latter idea, especially with respect to XQuery. I argued that declarative programming is the key, but that it is quite possible to take advantage of declarative programming outside of XQuery. Nothing new there: I've been arguing the marriage of XML and declarative techniques within "agile" languages for years. I don't think that declarative techniques inevitably require bondage-and-discipline type systems (thanks to Amyzing (1), (2) for that killer epithet).

Since then, I've also been pondering the XML-too-hard angle. I think folks such as Adam Bosworth are discounting the fact that as organizations increasingly build business plans around aggregation and integration of Web material, there comes an inevitable backlash against the slovenliness of HTML legacy and RSS Babel. Sloppy might be good enough for Google, but who makes money off that? Yeah. Just Google. Others such as Yahoo and Microsoft have started to see the importance of manageable text formats and at least modest depth of metadata. The IE7 team's "well-formed-Web-feeds-only" pledge is just one recent indicator that there will be a shake-up. No one will outlaw tag soup overnight, but as publishers find that they have to produce clean data, and some minimally clean metadata to participate in large parts of the not-Google-after-Web marketplace, they will fall in line. Of course this does not mean that there won't be people gaming the system, and all this Fancy Web agitation is probably just a big, speculative bubble that will burst soon and take with it all these centralizing forces, but at least in the medium term, I think that pressure on publishers will lead to a healthy market for good non-sloppy tools, which is the key to non-sloppy data.

Past success is no predictor of future performance, and that goes for the Web as well. I believe that folks whose scorn of "Web 2.0" takes them all the way back to what they call "Web 1.0" are buying airline stock in August of 2001.

[Uche Ogbuji]

via Copia

Intimations of an evil Google?

The rumblings about innocent people getting caught in Google's super-secret spam and fraud detection systems is becoming impossible to ignore. The more I hear stories from folks I respect (Ned Batchelder's case is the latest I've run across), the more I think Google has a burgeoning problem of its hands. Google's benefits greatly from its "do no evil" reputation, but this reputation is starting to wear some seriously grungy shadows.

I'm also pondering the off chance that I may myself have fallen victim to some over-zealous fraud cop at Google. Recently I noticed that my personal home pages disappeared from Google search results. I recently changed to a hacked-up CherryPy set-up for these pages, and my first thought was that I'd done something violating some prime directive of search engine optimization. I've never paid much attention to SEO, so I figured I'd look into it when I had a moment and fix meta tags or whatever was looking skunky to Google's indexer. Then I noticed that the pages in question continue to occupy their typically high ranking on Yahoo search. I know the two indexes use different algorithms, but the contrast seems too sharp to be a question of simple metadata massage.

Then while doing research for my article "Google Sitemaps" I happened across this discussion of "exclusion" and "reinclusion" from Google indexes. This is the first I've ever heard of such matters, but it does rather feel like what happened to my home pages. I can't imagine what I would have done to fall afoul of Google's anti-fraud system considering I've never been the slightest bit interested in SEO, and my high ranking has always come because a gratifying number of people seem to like and link to what I write.

I'll be looking further into whether Google might have excluded my pages, but regardless of what's going on in my particular case, based on what I'm reading about Google lately, I'm beginning to think the company is really straining with the effort of balancing good vibes and break-neck growth.

[Uche Ogbuji]

via Copia

Agile Web #1: "Google Sitemaps"

"Google Sitemaps"

Uche Ogbuji's new XML.com column, "Agile Web," explores the intersection of agile programming languages and Web 2.0. In this first installment he examines Google's Sitemaps schema, as well as Python and XSLT code to generate site maps. [Oct. 26, 2005]

And with this article the "Python and XML" column has been replaced by a new one titled "Agile Web".

I wrote the Python-XML column for three years, discussing the combination of an agile programming language with an agile data format. It's time to pull the lens back a bit to take in other such technologies. This new column, "Agile Web," will cover the intersection of dynamic programming languages and web technologies, particularly the sorts of dynamic developments on the web for which some use the moniker, "Web 2.0." The primary language focus will still be Python, with some ECMAScript. Occasionally there will be some coverage of other dynamic languages as well.

In this first article I introduce the Google SiteMaps program, XML format and Python tools.

[Uche Ogbuji]

via Copia