XHTML to Excel as well as OOo

In my article "Use XSLT to prepare XML for import into OpenOffice Calc" I discussed how to use XSLT to format XML for import into OpenOffice Calc.

The popular open source office suite OpenOffice.org is XML-savvy at its core. It uses XML in its file formats and offers several XML-processing plug-ins, so you might expect it to have nice tools built in for importing XML data. Unfortunately, things are not so simple, and a bit of work is required to manipulate general XML into delimited text format in order to import the data into its spreadsheet component, Calc. This article offers a quick XSLT tool for this purpose and demonstrates the Calc import of records-oriented XML. In addition to learning a practical trick for working with Calc, you might also learn a few handy XSLT techniques for using dynamic criteria to transform XML.

I do wonder about formulae, though. Via Parand Tony Darugar I found "Excel Reports with Apache Cocoon and POI", which (in the section starting with the header "Rendering Machinery") shows that you can just as easily use such a technique for MS Excel. Good to know. I've recently had reason to figure out a system for aggregating reports to and from spreadsheets and XML, and I'd possibly have to deal with simple formulae. I guess I'll cross that bridge if I ever for sure get to it, and the full OpenDocument saved-file format is always an option, but if anyone does happen to have any quick pointers, I'd be grateful.

[Uche Ogbuji]

via Copia

“Real Web 2.0: Bookmarks? Tagging? Delicious!”

“Real Web 2.0: Bookmarks? Tagging? Delicious!”

Subtitle: Learn how real-world developers and users gain value from a classic Web 2.0 site
Synopsis: In this article, you'll learn how to work with del.icio.us, one of the classic Web 2.0 sites, using Web XML feeds and JSON, in Python and ECMAScript. When you think of Web 2.0 technology, you might think of the latest Ajax tricks, but that is just a small part of the picture. More fundamental concerns are open data, simple APIs, and features that encourage users to form social networks. These are also what make Web 2.0 a compelling problem for Web architects. This column will look more than skin deep at important real-world Web 2.0 sites and demonstrate how Web architects can incorporate the best from the Web into their own Web sites.

This is the first installment of a new column, Real Web 2.0. Of course "Web 2.0" is a hype term, and as has been argued to sheer tedium, it doesn't offer anything but the most incremental advances, but in keeping with my tendency of mildness towards buzzwords I think that anything that helps focus Web developers on collaborative features of Web sites is a good thing. And that's what this column is about. It's not about the Miss AJAX pageant, but rather about open data for users and developers. From the article:

The substance of an effective Web 2.0 site, and the points of interest for Web architects (as opposed to, say, Web designers), lie in how readily real developers and users can take advantage of open data features. From widgets that users can use to customize their bits of territory on a social site to mashups that developers can use to create offspring from Web 2.0 parents, there are ways to understand what leads to success for such sites, and how you can emulate such success in your own work. This column, Real Web 2.0, will cut through the hype to focus on the most valuable features of actual sites from the perspective of the Web architect. In this first installment, I'll begin with one of the ancestors of the genre, del.icio.us.

And I still don't want that that monkey-ass Web 1.0. Anyway, as usual, there's lots of code here. Python, Amara, ECMAScript, JSON, and more. That will be the recipe (mixing up the ingredients a bit each time) as I journey along the poster child sites for open data.

[Uche Ogbuji]

via Copia

“Mix and match Web components with Python WSGI”

“Mix and match Web components with Python WSGI”

Subtitle: Learn about the Python standard for building Web applications with maximum flexibility
Synopsis: Learn to create and reuse components in your Web server using Python. The Python community created the Web Server Gateway Interface (WSGI), a standard for creating Python Web components that work across servers and frameworks. It provides a way to develop Web applications that take advantage of the many strengths of different Web tools. This article introduces WSGI and shows how to develop components that contribute to well-designed Web applications.

Despite the ripples in the Python community over Guido's endorsement of Django (more on this in a later posting), I'm not the least bit interested in any one Python Web framework any more. WSGI has set me free. WSGI is brilliant. It's certainly flawed, largely because of legacy requirements, but the fact that it's so good despite those flaws is amazing.

I wrote this article because I think too many introductions to WSGI, and especially middleware, are either too simple or too complicated. In line with my usual article writing philosophy of what could I have read when I started out to make me understand this topic more clearly, I've tried to provide sharp illustration of the WSGI model, and a few clear and practical examples. The articles I read that were too simple glossed over nuances that I think should really be grasped from the beginning (and are not that intimidating). In the too-complicated corner is primarily PEP 333 itself, which is fairly well written, but too rigorous for an intro. In addition, I think the example of WSGI middleware in the PEP is very poor. I'm quite proud of the example I crafted for this article, and I hope it helps encourage more people to create middleware.

I do want to put in a good word for Ian Bicking and Paste. He has put in tireless effort to evangelize WSGI (it was his patient discussion that won me over to WSGI). In his Paste toolkit, he's turned WSGI's theoretical strengths into readily-available code. On the first project I undertook using a Paste-based framework (Pylons), I was amazed at my productivity, even considering that I'm used to productive programming in Python. The experience certainly left me wondering why, BDFL or no BDFL, I would choose a huge mega-framework over a loosely-coupled system of rich components.

[Uche Ogbuji]

via Copia

“Get free stuff for Web design”

"Get free stuff for Web design"

Subtitle: Spice up your Web site with a variety of free resources from fellow designers
Synopsis: Web developers can find many free resources, although some are freer than others. If you design a Web site or Web application, whether static or with all the dynamic Ajax goodness you can conjure up, you might find resources to lighten your load and spice up your content. From free icons to Web layouts and templates to on-line Web page tools, this article demonstrates that a Web architect can also get help these days at little or no cost.

This was a particularly fun article to write. I'm no Web designer (as you can tell by looking at any of my Web sites), but as an architect specializing in Web-fronted systems I often have to throw a Web template together, and I've slowly put together a set of handy resources for free Web raw materials. I discuss most of those in this article. On Copia Chimezie recently mentioned OpenClipart.org, which I've been using for a while, and is mentioned in the article. I was surprised it was news to as savvy a reader as Sam Ruby. That makes me think many other references in the article will be useful.

[Uche Ogbuji]

via Copia

"Tip: Rescue terrible HTML with TagSoup"

Well, since I've so emphatically broken my Weblogging pause for The Cup, I'd better post some professional items.

"Tip: Rescue terrible HTML with TagSoup"

Subtitle: Turn poorly formed HTML into valid XHTML
Synopsis: XHTML is a friendly enough format for parsing and screen-scraping, but the Web still has a lot of messy HTML out there. In this tip Uche Ogbuji demonstrates the use of TagSoup to turn just about any HTML into neat XHTML.

TagSoup is very handy. EVen though it's a Java project I put it to use from Python code fairly often. It also recently went full 1.0.

[Uche Ogbuji]

via Copia

Four Mozilla/XML bugs to vote on (or to help with)

In a recent conversation with colleagues some of the limitations of XML processing in Mozilla came up. I think some of these are really holding Mozilla and Firefox back from being a great platform for XML processing, and so I wanted to highlight them here. Remember that the key to bringing attention to an important bug/request is to vote for it in the tracker, so please consider doing so. I already have done.

18333: "XML Content Sink should be incremental". The description says it all:

Large XML documents, such as the W3C's XSLT spec, take an incredibly long time to load into view source. The browser freezes/blocks (is "not responding" according to Windows) while it processes, and finally unlocks after the entire source of the document is ready for display.

Firefox will never really be a friendly platform for XML processing until this is addressed. There is not really a problem addressing this using the Mozilla's underlying parser, Expat. Worst case one could use that parser's suspend/resume facility (we recently took advantage of this to allow Python-generator-based access to 4Suite Saxlette parsing). The real issue is the amount of work that would need to be done across the Mozilla code base. Unfortunately, Mozilla insiders have been predicting a fix for this problem for a while, and unless there's a sudden boost in votes or better yet resources to help fix the problem, I'm not feeling very optimistic.

69799: "External entities are not included in XML document". Using Betty Harvey's example,

<!ENTITY extFile SYSTEM "extFile.xml">

Is rendered as if Mozilla read


Of course you have to watch out for XSS type attacks, but I imagine Mozilla could handle this the same way it does loaded stylesheets: by restricting to same host domain as the document entity.

193678: "support exslt:common". The node-set extension function is pretty much required for complex XSLT processing, so support from Mozilla would really help open up the landscape of what you can do with XSLT in the browser.

98413: "Implement XML Catalogs". A request to implement OASIS Open XML Catalogs. This could do a lot to encourage support for external entities because some performance problems could be reduced by using a catalog to load a local version of the resource.

A few on my personal would-be-nice-to-fix-but-not-essential list are:

See also:

[Uche Ogbuji]

via Copia

Processing "Web 2.0" using XSLT document() variants? No thanks.

Mark Nottingham has written an intriguing piece "XSLT for the Rest of the Web". It's drummed up some interest, some of which has even leaked into the 4Suite mailing list thanks to the energetic Sylvain Hellegouarch. Mark says:

I’ve raved before about how useful the XSLT document() function is, once you get used to it. However, the stars have to be aligned just so to use it; the Web site can’t use cookies for anything important, and the content you’re interested in has to be available in well-formed XML.

He goes on to present a set of extension functions he's created for libxslt. They are basically smarter document() functions that can do fancy Web things, including HTTP POST, and using HTML Tidy to grab tag soup HTML as XHTML.

As I read through it, I must say my strong impression was "been there, done that, probably never looking back". Certainly no diss of Mark intended there. He's one of the sharper hackers I know. I guess we're just at different points in our thinking of where XSLT fits into the Web-savvy apps toolkit.

First of all, I think the Web has more dragons than you could easily tame with even the mightiest XSLT extension hackery. I think you need general-purpose programming language to wrangle "Web 2.0" without drowning in tears.

More importantly, if I ever needed XSLT's document() function to process anything more than it's spec'ed to, I would consider that a pretty strong indicator that it's time to rethink part of my application architecture.

You see, I used to be a devotee of XSLT all over the place, and XSLT extensions for just about every limitation of the language. Heck, I wrote a whole framework of such things into 4Suite Repository. I've since reformed. These days I take the pipeline approach to such processing, and I keep XSLT firmly in the narrow niche for which it was designed. I have more on this evolution of thinking in "Lifting XSLT into application domain with extension functions?".

But back to Mark's idea. I actually implemented 4Suite XSLT extensions to use HTTP POST and to tidy tag soup HTML into XHTML, but I wouldn't dream of using these extensions any more. Nowadays, I use Python to gather and prepare data into a model representation that I then hand over to XSLT for pure presentation processing. Complex logical tasks such as accessing Web data beyond trivially fetched XML are matters for the model layer, and not the presentation logic. For example, if I need to tidy something, I tidy it at the Python level and put what I need of the resulting XHTML into the model XML before passing it to XSLT. I use Amara XML Toolkit with John Cowan's TagSoup for my tidying needs. I prefer TagSoup rather than tidy because I find it's faster and more robust.

Even if you use the libxml2 family of tools, I still think it's better to use libxml, and perhaps the libxml HTML parser to do the model processing and hand over resulting XML to libxslt in a separate step.

XSLT is pretty cool, but these days rather than reproduce all of Python's dozens of Web processing libraries therein, I plump for Python itself.

[Uche Ogbuji]

via Copia

Nofollow-free Copia

I read this interesting post by Darryl VanDorp:

Hey people new to this whole blog thing. If you want me to actually leave a comment on that new blog of yours. Turn off that damn default nofollow in your shiny wordpress installation ... If I take the time to leave a comment give me some juice!

I quite agree. Yes, I know nofollow came about as a way to discourage Weblog spammers, but I don't believe it will ever work. It is just too easy to write robots that leave comment spam, and unless 100% of systems use nofollow, which is never likely, it will always be worth the spammers' while to keep trying on the off chance that they find sites without nofollow. Also, spammers are probably happy to have their messages on Weblogs, google juice or no.

Look, we don't want comment spam on Copia, period. It's not enough to deprive them of google juice. We work to keep the site spam-free. All comments go in an approval queue, and we have a lot of handy little tools to help eliminate comment spam, so we can batch approve the remaining goodness. I think this takes a lot less effort than the actual work of composing entries on the blog (and I'm a fast writer). What's more relevant, it's less effort than it takes for visitors to compose their comments. As a result of this effort there is no need to use nofollow on Copia, and we don't do so. I don't know whether commenters get significant juice from Copia, but what juice we do have to give we shall not stint our correspondents. (Not even if they are here to engage in the heresy of disputing our positions.)

I can't help thinking of the gardening analogy. If you want a nice garden, you have to roll up your sleeves and pull up your weeds. There is no quick fix to preventing weeds: the conditions that make your soil attractive for germination of desirable flowers also make it attractive to Uncle Darwin's rogues.

[Uche Ogbuji]

via Copia

Another small 4Suite MarkupWriter example: XHTML 1.1

I was writing code to emit XHTML 1.1 using 4Suite and just to double-check the doc types I looked at the spec. I thought it might be useful to write up a small MarkupWriter example for emitting the example in the spec.

from Ft.Xml.MarkupWriter import MarkupWriter

XHTML11_SYSID = u"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"
XHTML11_PUBID = u"-//W3C//DTD XHTML 1.1//EN"

writer = MarkupWriter(indent=u"yes", doctypeSystem=XHTML11_SYSID,
writer.startElement(u'html', XHTML_NS, attributes={(u'xml:lang', XML_NS): u'en'})
writer.startElement(u'head', XHTML_NS)
writer.simpleElement(u'title', XHTML_NS, content=u'Virtual Library')
writer.endElement(u'head', XHTML_NS)
writer.startElement(u'body', XHTML_NS)
writer.startElement(u'p', XHTML_NS)
writer.text(u'Moved to ')
writer.simpleElement(u'a', XHTML_NS,
                     attributes={u'href': u'http://vlib.org/'},
writer.endElement(u'p', XHTML_NS)
writer.endElement(u'body', XHTML_NS)
writer.endElement(u'html', XHTML_NS)

It's worth mentioning that this example would be even simpler with template output facilities I've proposed for Amara.

[Uche Ogbuji]

via Copia