Embedded markup

In an article I'm working on I refer to Norm Walsh's piece Embedded Markup Considered Harmful and his follow-up "Escaped Markup: What To Do Instead". I've always urged people to use type="xhtml" in Atom rather than type="html" and do the tidying to XHTML in the aggregation processing stage, and my arguments largely line up with Norm's.

From a comment by Dan Connolly I also found this older XML.com article with the same title as Norm's: "Embedded Markup Considered Harmful" by Theodor Holm Nelson. I can't make much sense of what Mr. Nelson is saying. Can you?

[Uche Ogbuji]

via Copia

XSLT l10n in 2 tags? 20 says the gentleman over there? Ah, the lady in green has 200.

I was pretty well ROTFL after reading this Daily WTF (via XSLT Blog). OK, so that code isn't even really doing l10n. I'm not sure what the coder thinks it's doing. It's a complete exercise in useless cut and paste. But it's worth noting that you can do the task of competent l10n in a tenth of the tag load used in the WTF example (see Docbook XSL), and you can even do it using a tenth of the tag load used in Docbook, if you don't mind using an XSLT extension module.

[Uche Ogbuji]

via Copia

XSLT l10n in 2 tags? 20 says the gentleman over there? Ah, the lady in green has 200.

I was pretty well ROTFL after reading this Daily WTF (via XSLT Blog). OK, so that code isn't even really doing l10n. I'm not sure what the coder thinks it's doing. It's a complete exercise in useless cut and paste. But it's worth noting that you can do the task of competent l10n in a tenth of the tag load used in the WTF example (see Docbook XSL), and you can even do it using a tenth of the tag load used in Docbook, if you don't mind using an XSLT extension module.

[Uche Ogbuji]

via Copia

Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia

Follow-up Copia housecleaning

Of course chores lead to more chores. After last week's round of tweaks to Copia I got a suggestion from Aristotle to rearrange the entry-specific titles, and I've done so. I got a bit more info from Tom Passin about possible encoding problems that has only deepened my bafflement.

I also noticed there has been some confusion over last week's birth announcement. It came from Chimezie, not me (congrats, brother!). On Copia the authors is specified for each entry, but previously there wasn't any such useful distinction being made in the Atom 0.3 (I'll be working on an Atom 1.0 flavor for PyBlosxom soon) or RSS 1.0 feeds. I've fixed that, but I've done through in a way I'm not sure all feed sinks will process correctly. In the Atom feed there is a top-level

<author>
    <name>Uche and Chimezie Ogbuji</name>
    <url>http://copia.ogbuji.net/blog/</url>
    <email>uche@ogbuji.net</email>
  </author>

And then for each entry a more specific authors, for example:

<title>Chikaora Zion Credell Ogbuji</title>
    ...
    <author>
      <name>chimezie</name>
    </author>

I hope that helps. I made some other tweaks to the feeds, and this does seem to have had the unfortunate side-effect of pushing everything back onto the front page of Planet XML. My apologies to Planet XML readers (including me: I'd hoped to catch up after the holidays and found only Copia entries).

Copia already tells you the author of each entry, in the info line at the end of the entry.

[Uche Ogbuji]

via Copia

Copia housecleaning

I've finally had some time today, as I prepare for the holidays, to fix some things on Copia that have been broken for too long. Some of the highlights, especially concerning issues mentioned by readers (thanks, guys), are:

RSS 1.0 feed body fix. Added rss:description field for the RSS 1.0 feed, which fixes missing post bodies in readers such as Bloglines which don't support support content:encoded. I do truncate the field to 500 characters, according to the recommendation in the spec.

Single entry view title fix. Added entry titles for single entry pages. Before today, if you viewed this entry through the perma-link, the title would just say "Copia"; now it says "Copia ✏Copia housecleaning"). I've wanted to do this for a while, but I was having the devil of a time figuring out how to do it with PyBlosxom. A scolding from Dan Connolly forced me to chase down a fix. For other PyBlosxom users the trick is to use the comments plug-in, copy the head.* flavor file to comment-head.*, and then update to use the $title variable, which is the title of the entry itself ($blog-title is the title of the entire blog). In my case the updated HTML header template looks like:

<title>$blog_title &#x270F;$title</title>

I did get a report that Copia is incorrectly sending `Content-Type header text/html;charset=ISO-8859-1`, but when I check using the LiveHTTPHeaders extension for FireFox on Linux it reports the correct charset=UTF-8 from the server. If anyone else can corroborate this issue, please leave a comment with the specific URL from which you noticed the error, your platform and browser, and the HTTP sniffing tool were you using. Thanks.

[Uche Ogbuji]

via Copia

Copia housecleaning

I've finally had some time today, as I prepare for the holidays, to fix some things on Copia that have been broken for too long. Some of the highlights, especially concerning issues mentioned by readers (thanks, guys), are:

RSS 1.0 feed body fix. Added rss:description field for the RSS 1.0 feed, which fixes missing post bodies in readers such as Bloglines which don't support support content:encoded. I do truncate the field to 500 characters, according to the recommendation in the spec.

Single entry view title fix. Added entry titles for single entry pages. Before today, if you viewed this entry through the perma-link, the title would just say "Copia"; now it says "Copia ✏Copia housecleaning"). I've wanted to do this for a while, but I was having the devil of a time figuring out how to do it with PyBlosxom. A scolding from Dan Connolly forced me to chase down a fix. For other PyBlosxom users the trick is to use the comments plug-in, copy the head.* flavor file to comment-head.*, and then update to use the $title variable, which is the title of the entry itself ($blog-title is the title of the entire blog). In my case the updated HTML header template looks like:

<title>$blog_title &#x270F;$title</title>

I did get a report that Copia is incorrectly sending `Content-Type header text/html;charset=ISO-8859-1`, but when I check using the LiveHTTPHeaders extension for FireFox on Linux it reports the correct charset=UTF-8 from the server. If anyone else can corroborate this issue, please leave a comment with the specific URL from which you noticed the error, your platform and browser, and the HTTP sniffing tool were you using. Thanks.

[Uche Ogbuji]

via Copia

Ouch. I feel your pain, Sam

Seems Sam Ruby's presentation suffered a bit in the spectacle. Unfortunately, I'm no stranger to presentation set-up problems. I've also been lucky enough to have patient audiences. Maybe conference organizers will someday factor Linux A/V support into consideration when choosing venues (I can dream, eh?). I almost always can use projectors and stuff with no problem in usual business scenarios, and I can only guess that conference venues tend to have archaic A/V technology that doesn't like Linux.

As for the presentation itself, based on the slides much of it is an accumulation of issues probably well known to, say, a long-time XML-DEV reader, but useful to collect in one place. It looks like a much-needed presentation, and I hope Sam gets to present it again, with better luck with the facilities. Here follow a few reactions I had to stuff in the slides.

expat only understands utf-8

This hasn't been true for ages. Expat currently understands UTF-8, UTF-16, ASCII, ISO-8859-1, out of the box, and the user can add to this list by registering an "unknown encoding" event handler.

Encoding was routinely ignored by most of the initial RSS parsers and even the initial UserLand RSS validator. “Aggregators” did the equivalent of strcat from various sources and left the results for the browser

Yuck. Unfortunately, I worry that Mark Pilgrim's Universal Feed Parser might not help the situation with its current practice of returning some character data as strings without even guessed encoding information (that I could find, anyway). I found it very hard to build a character-correct aggregator around the Feed Parser 4.0 alpha version. Then again, I understand it's a hard problem with all the character soup ("char soup"?) Web feeds out there.

[Buried] in a non-normative appendix, there is an indication that the encoding specified in an XML document may not be authoritative.

Nope. There is no burial going on. As I thought I've pointed out on Copia before (but I can't find the entry now), section " 4.3.3 Character Encoding in Entities" of XML 1.0 says:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

So the normative part of the spec also makes it quite clear that an externally specified encoding can trump what's in the XML or text declaration.

The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata.

Very apt. I think that's why XML's attributes work as well as they do (despite the fact that they are so inexplicably maligned in some quarters).

In fact, Microsoft’s new policy is that they will always completely ignore [HTTP Content-Type] for feeds—even when the charset is explicitly present

XML of course doesn't force anyone to conform to RFC 3023, but Microsoft could prove itself a really good Web citizen by adopting it. Maybe they could lead the way to reducing the confusion I mention in this entry.

I think of Ruby's section on the WS-* mess to be an excellent indictment of the silly idea of universal and strong data typing.

In general, programming XML is hard.

Indeed it is. Some people seem to think this is a product of architecture astronautics. They are laughably mistaken. XML is hard because managing data is hard. Programmers have developed terrible habits through long years of just throwing their data over the wall at a SQL DBMS and hoping all goes OK in the end. The Web is ruthless in punishing such diffidence.

XML is the first technology that has forced mainstream programmers to truly have to think hard about data. This is a boundlessly good thing. Let the annoyances proliferate (that's your cue, Micah).

[Uche Ogbuji]

via Copia

XOXO versus Atom versus XBEL for Web feed lists?

Very useful response to my quest for XOXO understanding. Wow. In fact, I've had so much good discussion here and on the #atom IRC that I'm not sure where to start.

I'll start with what my current leaning is, after all that discussion. I think I want to use XBEL for managing my bookmarks, and I'll advocate the necessary extensions, and hope they are as well received as XBEL itself has been.

On to my general thoughts.

XOXO. I was pointed to a thread in which my use of rel="webfeed" was rightly deprecated. I was just grasping for something appropriate in XHTML space, and I don't think there really is. I got into an argument about whether type="application/atom+xml" is the way to go, and I'm still not fully convinced it is. The way it works for me to think about it is that for each item I have zero or more links I want to consider human-readable content links, and zero or more I want to consider feed links. It's my arbitrary decision how to decide which is which, but media type alone doesn't tell the tale. As an example, what if someone has a Weblog that is served as custom XML with XSLT to render it in my browser. The media type would be application/xml and yet it would be a content link. I see the weakness in my own argument: after all, a user agent can also make an Atom document nicely rendered for a human reader, so the distinction I'm trying to make might be an impossible one regardless of method. Maybe I'll have to cave in to the "use type" argument. I'll try it out in this entry.

So back to my first format example for XOXO. If I throw in examples of "folders", I come up with:

<ol class="xoxo">
  <li>
    <p>Technology</p>
    <ol>
      <li>
        <ul>
          <li>
            Weblog home
            Weblog feed
            <dl>
              <dt>description</dt>
              <dd>That good ole Weblog</dd>
            </dl>
          </li>
        </ul>
      </li>
    </ol>
  </li>
</ol>

Umm. That's a round mouthful of tags. I'm not sure I really want to have to squint at that. I also don't like how <description>...</description> becomes <dl><dt>description</dt><dd>...</dd></dl>. That's too much markup indirection for me. Even worse that the usual reduction ad absurdum of <element name="description">....

Atom. Aristotle and then Mark suggested using Atom for Web feed list exchange. It's one of those "DUH!" moments. I can't believe it didn't occur to me. The main problem I have with Atom is that is really only offers one level of hierarchy: feed/entry. My Web feed list is hierarchical. The ready solution is to use categories to simulate hierarchies.

<entry>
  <id>http://example.com/weblog/</id>
  <updated>2005-05-23T15:38:00-08:00</updated>
  <title>Example Weblog</title>
  <link rel="self" href="http://example.com/weblog/" type="text/html">Weblog home</link>
  <link rel="alternate" href="http://example.com/weblog/atom" type="application/atom+xml">Weblog feed</link>
  <summary>That good ole Weblog</summary>
  <category term="Technology"/>
</entry>

I'd also need some way to separately express the hierarchy of categories. I might also have to use use the ranking extension (see for example this article) to preserve item order, if I care about it. I don't know. This looks a bit of a stretch. It certainly wouldn't be very friendly to edit by hand. The fact that the updated element is mandated, for one thing, tends to color Atom into a machine-generated-only lines. This is OK for a lot of Atom's use cases, but I think it's a real bummer for my present one.

XBEL. XBEL is just the little XML format engine that could. Despite its great age and recent neglect, it's possibly the most widely deployed of these options, because of its use in Browser bookmark formats. Not that I think that's any reason for preferring XBEL. I do like how it looks, though:

<folder>
  <title>Technology</title>
  <bookmark href="http://example.com/weblog/">
    <title>Example Weblog</title>
    <info>
      <metadata owner="webfeed">
        <link href="http://example.com/weblog/atom" type="application/atom+xml"/>
        <description>That good ole Weblog</description>
      </metadata>
    </info>
  </bookmark>
</folder>

Even with the required info/metadata layer, it's much cleaner than the other two options. I think that we could at least get into XBEL 1.1 native elements for alternate links and for bookmark descriptions (which I think were already proposed by others), so we could perhaps eliminate the info/metadata layer in XBEL 1.1. Regardless, I think I'll manage things for myself in XBEL as above and just use XSLT to convert to whatever starts t emerge as a viable option for export to other tools (or as a way to export to OPML if I have to).

[Uche Ogbuji]

via Copia

I must be missing something about XOXO (and maybe microformats in general)

As I mentioned I started working with XBEL as a way to manage my Web feeds. It occurred to me that I should consider XOXO for this purpose, since it hsa more traditionally been put up in opposition to OPML.

Well, I don't get it. Sure, it's simple XHTML with some conventions for overlaid semantics, but how does that do anything for interoperability of Webfeed subscription lists? I've taken a look at attention.xml, and that seems more thoroughly specified, but it's way overkill for my needs.

Look, all I need to do is represent a categorized structure of feeds items with the following information per item:

  1. Web feed URL (e.g. RSS or Atom link)
  2. title
  3. optional description or notes
  4. optional Web site URL (e.g. Link to HTML or XHTML page for Weblog)

The trouble with OPML is that there are dozens of ways to encode these four bits of information, and as I've found, tools pretty much range all across that dozen. Besides, OPML is really poor XML design. That's a practical and not just aesthetic concern, because I expect to manage this information partly by hand.

XBEL is much better markup design, but I don't know that it has a natural way to represent the distinction between feed and content URL for the same item ("bookmark").

Everything I heard about XOXO led me to believe that this is a slam dunk, but hardly. The XOXO "spec" is not all that illuminating, and from what I can gather there or elsewhere, there are also a dozen ways I could encode the above information. Perhaps:

<li>
  Weblog home
  Weblog feed
  <dl>
    <dt>description</dt>
    <dd>That good ole Weblog</dd>
  </dl>
</li>

Perhaps (so that the likely HTML rendering is not jumbled):

<li>
  <ul>
    <li>Weblog home</li>
    <li>Weblog feed</li>
  </ul>
  <dl>
    <dt>description</dt>
    <dd>That good ole Weblog</dd>
  </dl>
</li>

But since Weblog contents could be XML, is it really safe to use media type as the distinguishing mark between Web site and Web feed links? OK, so perhaps:

<li>
  <ul>
    <li>Weblog home</li>
    <li>Weblog feed</li>
  </ul>
  <dl>
    <dt>description</dt>
    <dd>That good ole Weblog</dd>
  </dl>
</li>

But now I've invented a relationship vocabulary (I guess this is technically my own microformat) and why would I expect another XOXO tool to use rel="website" and rel="webfeed"?

I could go on with possible variations. I do like the way that I can simply refer to the XHTML Hypertext Attributes Module to get some general ideas about semantics, but that's not really good enough because I have a fairly specific need.

I imagine someone will say that XOXO is just a general outlining format, and can't specify such things because it's all about being micro. But in that case why do people put XOXO itself as a solution for Webfeed corpus exchange? I can't see how XOXO can do the job without overlaying yet another microformat on it. And if we need to stack microformat on microformat to address such a simple need, what's wrong with good old macroformats: you know: a real, specialized XML format.

I've really only spent an hour or two exploring XOXO (although according to the microformats hype I shouldn't expect to need more time than that), so maybe I'm missing something. If so, I'd be grateful for any enlightening comments.

[Uche Ogbuji]

via Copia