Cop it while it's hot: 4Suite XML 1.0b2

Updated with working link for the manual

We've announced 4Suite XML 1.0b2 We're a big step towards a 1.0 release, even bigger than most of our releases because what we've done with this is trim the overall 4Suite package to a sensible size and scope. This release contains only the XML processing core and some support libraries. It does not contain the RDF libraries and the repository. This does not mean those components are stranded (see, for example, the rdflib/4RDF merger effort for a sense of the new juice being fed into 4Suite/RDF). It's just that the core XML libraries are so much more mature than the other components, and so much more widely used, that it made no sense not to set it free and let it quickly march to 1.0 on its own. This release features some serious performance improvements, some simplified APIs for a gentler user learning curve, and a lot of fixes and other improvements (see the announcement for the full, long list).

In fact, the code we released is just about 1.0 in itself, as far as the XML component goes. A code freeze is in place, and we'll focus on fixing bugs and wrapping up the user manual effort. (BTW, if you'd like to help chip into the manual, please say so on the 4Suite-dev mailing list; there is a lot of material in place, and what we need is mostly in the way of editing and improving details). Our plan is to get the XML core to 1.0 more quickly than we would have been able before breaking 4Suite into components, and then we can focus on RDF and the repository. 4Suite/RDF will probably disappear into the new rdflib, and the repository will probably go through heavy refactoring and simplification.

Today, after some day-job tasks, my priority will be getting Amara 1.1.5 out. It's been largely ready and waiting for the 4Suite release. Some really sweet improvements and additions in this Amara release (though I do say so myself). More on that later.

[Uche Ogbuji]

via Copia

The Buzzword Firing Squad

If buzzword proliferation was a punishable crime, the penitentiaries would be full of software developers and blog authors.

Below is my list of buzzwords, catch-phrases, and technologies that need to be summarily executed without a lengthy trial:

  • Web 2.0 (do I even need a reason?)
  • AJAX (ahem.. The idea of asynchronous HTTP requests for XML content is core to XForms and better architected in that context)
  • SOA / Web Services (90% of the time people use this term they refer to SOAP-based remote procedure invokation specifically)
  • RDF/XML Syntax (This one has done more damage to RDF advocacy than any other)
  • Semantic (This term is so thoroughly abused that it would be par for the course to read of cron job referred to as a 'semantic' process.)
  • "Yes, that's very nice and all, but does it scale?" !*&%#@@$!!!
  • Ontology: I've found it easiest to think of an Ontology as a Taxonomy (or polyhierarchy) with a minimal number of logical constraints. Without logical constraints, there is nothing 'ontological' about a model that could easily be represented in an E-R diagram.

I'm certain this is only about 2% of the full list, I'll be sure to add to it as more come to mind.

[Uche Ogbuji]

via Copia

Processing "Web 2.0" using XSLT document() variants? No thanks.

Mark Nottingham has written an intriguing piece "XSLT for the Rest of the Web". It's drummed up some interest, some of which has even leaked into the 4Suite mailing list thanks to the energetic Sylvain Hellegouarch. Mark says:

I’ve raved before about how useful the XSLT document() function is, once you get used to it. However, the stars have to be aligned just so to use it; the Web site can’t use cookies for anything important, and the content you’re interested in has to be available in well-formed XML.

He goes on to present a set of extension functions he's created for libxslt. They are basically smarter document() functions that can do fancy Web things, including HTTP POST, and using HTML Tidy to grab tag soup HTML as XHTML.

As I read through it, I must say my strong impression was "been there, done that, probably never looking back". Certainly no diss of Mark intended there. He's one of the sharper hackers I know. I guess we're just at different points in our thinking of where XSLT fits into the Web-savvy apps toolkit.

First of all, I think the Web has more dragons than you could easily tame with even the mightiest XSLT extension hackery. I think you need general-purpose programming language to wrangle "Web 2.0" without drowning in tears.

More importantly, if I ever needed XSLT's document() function to process anything more than it's spec'ed to, I would consider that a pretty strong indicator that it's time to rethink part of my application architecture.

You see, I used to be a devotee of XSLT all over the place, and XSLT extensions for just about every limitation of the language. Heck, I wrote a whole framework of such things into 4Suite Repository. I've since reformed. These days I take the pipeline approach to such processing, and I keep XSLT firmly in the narrow niche for which it was designed. I have more on this evolution of thinking in "Lifting XSLT into application domain with extension functions?".

But back to Mark's idea. I actually implemented 4Suite XSLT extensions to use HTTP POST and to tidy tag soup HTML into XHTML, but I wouldn't dream of using these extensions any more. Nowadays, I use Python to gather and prepare data into a model representation that I then hand over to XSLT for pure presentation processing. Complex logical tasks such as accessing Web data beyond trivially fetched XML are matters for the model layer, and not the presentation logic. For example, if I need to tidy something, I tidy it at the Python level and put what I need of the resulting XHTML into the model XML before passing it to XSLT. I use Amara XML Toolkit with John Cowan's TagSoup for my tidying needs. I prefer TagSoup rather than tidy because I find it's faster and more robust.

Even if you use the libxml2 family of tools, I still think it's better to use libxml, and perhaps the libxml HTML parser to do the model processing and hand over resulting XML to libxslt in a separate step.

XSLT is pretty cool, but these days rather than reproduce all of Python's dozens of Web processing libraries therein, I plump for Python itself.

[Uche Ogbuji]

via Copia

More on the PyBlosxom del.icio.us plug-in, and introducing task_control.py, a a pseudo cron plug-in for PyBlosxom

Micah put my del.icio.us daily links tool to immediate use on his blog. He uncovered a bug in the character handling, which is now fixed in the posted amara_delicious.py file.

I usually invoke the script from cron, but Micah asked if there was an alternative. I've been meaning to hack up a poor man's cron for PyBlosxom and this gave me an additional push. The result is task_control.py.

A sort of poor man's cron for PyBlosxom, this plug-in allows you to specify tasks (as Python scripts) to be run only at certain intervals Each time the plug-in is invoked it checks a set of tasks and the last time they were run. It runs only those that have not been run within the specified interval.

To run the Amara del.icio.us daily links script once a day, you would add the following to your config file:

py["tasks"] = {"/usr/local/bin/amara_delicious.py": 24*60*60}
py["task_control_file"] = py['datadir'] + "/task_control.dat"

You could of course have multiple script/interval mappings in the "tasks" dict. The scripts are run with variables request and config set, so, for example, if running from task_control.py, you could change the line of amara_delicious.py from

BASEDIR = '/srv/www/ogbuji.net/copia/pyblosxom/datadir'

to

BASEDIR = config['datadir']

[Uche Ogbuji]

via Copia

del.icio.us daily links, using Amara

I added a new feature on Copia: Every day there will be an automated posting with mine and Chime's del.icio.us links from the previous day. You can see, in the previous Copia entry to this one, an example of the results.

What I think most cool is how easy it was to write, and how easy the resulting code is to understand. It's just 35 lines (including 7 lines of imports) , and in that it packs some useful features I haven't found in other such scripts, including:

  • Full Unicode safety (naturally, I wouldn't have it any other way)
  • support for multiple del.icio.us feeds, with tag by author
  • tagging the PyBlosxom entry with the aggregated/unique tags from the del.icio.us entries

Here's the code. The only external requirement is Amara:

import os
import sets
import time
import codecs
import itertools
from datetime import date, timedelta

from amara import binderytools

TAGBASE = 'http://del.icio.us/tag/'

#Change BASEDIR and FEEDS to customize
BASEDIR = '/srv/www/ogbuji.net/copia/pyblosxom/datadir'
FEEDS = ['http://del.icio.us/rss/uche', 'http://del.icio.us/rss/chimezie']

now = time.gmtime()
timestamp = unicode(time.strftime('%Y-%m-%dT%H:%M:%SZ', now))
targetdate = (date(*now[:3]) - timedelta(1)).isoformat()

#Using Amara.  Easy to just grab the RSS feed
docs = map(binderytools.bind_uri, FEEDS)
items = itertools.chain(*[ doc.RDF.item for doc in docs ])
current_items = [ item for item in items
                       if unicode(item.date).startswith(targetdate) ]
if current_items:
    # Create a Markdown page with the daily bookmarks.
    dir = '%s/%s' % (BASEDIR, targetdate)
    if not os.path.isdir(dir):
        os.makedirs(dir)
    f = codecs.open('%s/%s/del.icio.us.links.txt' % (BASEDIR, targetdate), 'w', 'utf-8')

    # Pyblosxom Title
    f.write(u'del.icio.us bookmarks for %s\n' % targetdate)

    tags = sets.Set()
    for item in current_items:
        tags.update([ li.resource[len(TAGBASE):] for li in item.topics.Bag.li ])
    f.write(u'#post_time %s\n'%(timestamp))
    f.write(u'<!--keywords: del.icio.us,%s -->\n'%(u','.join(tags)))

    for item in current_items:
        # List of links in Markdown.
        title = getattr(item, 'title', u'')
        href = getattr(item, 'link', u'')
        desc = getattr(item, 'description', u'')
        creator = getattr(item, 'creator', u'')
        f.write(u'* "[%s](%s)": %s *(from %s)*\n' % (title, href, desc, creator))

    f.close()

Or download amara_delicious.py.

You can see how easily you can process RSS 1.0 in Amara. I don't think actual RDF parsing/processing is a bit necessary. That extra layer is the first thing that decided me against Matt Biddulph's module, in addition to his use of libxml for XML processing, which is also used in Roberto De Almeida's.

[Uche Ogbuji]

via Copia

"Tip: Computing word count in XML documents" pubbed

"Tip: Computing word count in XML documents"

XML is text and yet more than just text -- sometimes you want to work with just the content rather than the tags and other markup. In this tip, Uche Ogbuji demonstrates simple techniques for counting the words in XML content using XSLT with or without additional tools.

It was just a few weeks after I sent the manuscript to the editor that this thread started up on XML-DEV. Spooky timing.

[Uche Ogbuji]

via Copia

Today's XML WTF: Internal entites in browsers

This unnecessary screw-up comes from the Mozilla project, of all places. Mozilla's XML support is improving all the time, as I discuss in my article on XML in Firefox, but the developer resources seem to lag the implementation, and this often leads to needless confusion. One that I ran into recently could perhaps be given the summary: "not everything in the Mozilla FAQ is accurate". From the Mozilla FAQ:

In older versions of Mozilla as well as in old Mozilla-based products, there is no pseudo-DTD catalog and the use of entities (other than the five pre-defined ones) leads to an XML parsing error. There are also other XHTML user agents that do not support entities (other than the five pre-defined ones). Since non-validating XML processors are not required to support entities (other than the five pre-defined ones), the use of entities (other than the five pre-defined ones) is inherently unsafe in XML documents intended for the Web. The best practice is to use straight UTF-8 instead of entities. (Numeric character references are safe, too.)

See the part in bold. Someone either didn't read the spec, or is intentionally throwing up a spec distortion field. The XML 1.0 spec provides a table in section 4.4: "XML Processor Treatment of Entities and References" which tells you how parsers are allowed to treat entities, and it flatly contradicts the bogus Mozilla FAQ statement above.

The main reason for the "WTF" is the fact that the Mozilla implementation actually gets it right. That it should. It's based on Expat. AFAIK Expat has always got this right (I've been using Expat about as long as the Mozilla project has been), so I'm not sure what inspired the above error. Mozilla should be touting its correct and useful behavior, rather than giving bogus excuses to its competitors.

This came up last week in the IBM developerWorks forum where a user was having problems with internal entities in XHTML. It turns out that he was missing an XHTML namespace (and based on my experimentation was probably serving up XHTML as text/html which is generally a no-no). It should have been a clear case of "Mozilla gets this right, and can we please get other browsers to fix their bugs?" but he found that FAQ entry and we both ended up victims of the red herring for a little while.

I didn't realize that the Mozilla implementation was right until I wrote a careful test case in preparation for my next Firefox/XML article. The following CherryPy code is a test server set-up for browser rendering of XHTML.

import cherrypy

INTENTITYXHTML = '''\
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
         "http://www.w3.org/TR/xhtml/DTD/xhtml1-strict.dtd" [
<!ENTITY internal "This is text placed as internal entity">
]>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
  <head>
    <title>Using Entity in xhtml</title>
  </head>
  <body>
    <p>This is text placed inline</p>
    <p>&internal;</p>
    <abbr title="&internal;">Titpaie</abbr>
  </body>
</html>
'''

class root:
    @cherrypy.expose
    def text_html(self):
        cherrypy.response.headerMap['Content-Type'] = "text/html; charset=utf-8"
        return INTENTITYXHTML

    @cherrypy.expose
    def text_xml(self):
        cherrypy.response.headerMap['Content-Type'] = "text/xml; charset=utf-8"
        return INTENTITYXHTML

    @cherrypy.expose
    def app_xml(self):
        cherrypy.response.headerMap['Content-Type'] = "application/xml; charset=utf-8"
        return INTENTITYXHTML

    @cherrypy.expose
    def app_xhtml(self):
        cherrypy.response.headerMap['Content-Type'] = "application/xhtml+xml; charset=utf-8"
        return INTENTITYXHTML

cherrypy.root = root()
cherrypy.config.update({'server.socketPort': 9999})
cherrypy.config.update({'logDebugInfoFilter.on': False})
cherrypy.server.start()

As an example, this code serves up a content type text/html when accessed through a URL such as http://localhost:9999/text_html. You should be able to work out the other URL to content type mappings from the code, even if you're not familiar with CherryPy or Python.

Firefox 1.0.7 handles all this very nicely. For text_xml, app_xml and app_xhtml you get just the XHTML rendering you'd expect, including the correct text in the attribute value with the mouse hovered over "Titpaie".

IE6 (Windows) and Safari 1.3.1 (OS X Panther) both have a lot of trouble with this.

IE6 in the text_xml and app_xml cases complains that it can't find http://www.w3.org/TR/xhtml/DTD/xhtml1-strict.dtd. In the app_xhtml case it treats the page as a download, which is reasonable, if not convenient.

Safari in the text_xml, app_xml and app_xhtml cases complains that the entity internal is undefined (??!!).

IE6, Safari and Mozilla in the text_html case all show the same output (looking, as it should, like busted HTML). That's just what you'd expect for a tag soup mode, and emphasizes hat you should leave text_html out of your XHTML vocabulary.

All this confusion and implementation difference illustrates the difficulty for folks trying to deploy XHTML, and why it's probably not yet realistic to deploy XHTML without some sort of browser sniffing (perhaps by checking the Accept header, though it's well known that browsers are sometimes dishonest with this header). I understand that the MSIE7 team hopes to address such problems. I don't know whether to expect the same from Safari. My focus in research and experimentation has been on Firefox.

One final note is that Mozilla does not support external parsed entities. This is legal (and some security experts claim even prudent). The relevant part of the XML 1.0 spec is section 4.4.3:

When an XML processor recognizes a reference to a parsed entity, in order to validate the document, the processor MUST include its replacement text. If the entity is external, and the processor is not attempting to validate the XML document, the processor MAY, but need not, include the entity's replacement text. If a non-validating processor does not include the replacement text, it MUST inform the application that it recognized, but did not read, the entity.

I would love Mozilla to adopt the idea in the next spec paragraph:

Browsers, for example, when encountering an external parsed entity reference, might choose to provide a visual indication of the entity's presence and retrieve it for display only on demand.

That would be very useful. I wonder whether it would be possible through a Firefox plug-in (probably not: I guess it would require very tight Expat integration for plug-ins).

[Uche Ogbuji]

via Copia

RSS feeds for 4Suite (etc.) mailing lists

Jeremy Kloth set up RSS content feeds for Fourthought-hosted mailing lists, including the 4Suite and EXSLT lists (all on Mailman). The list information page for all the lists has an RSS link in the header, so it should be picked up by most news readers. For convenience, though, here are the main lists and the corresponding feeds:

[Uche Ogbuji]

via Copia

Dare's XLINQ examples in Amara

Dare's examples for XLINQ are interesting. They are certainly more streamlined than the usual C# and Java fare I see, but still a bit clunky compared to what I'm used to in Python. To be fair a lot of that is on the C# language, so I'd be interested in seeing what XLINK looks like from Python.NET or Boo.

The following is my translation from Dare's fragments into corresponding Amara fragments (compatible with the Amara 1.2 branch).

'1. Creating an XML document'

import amara
#Done in 2 chunks just to show the range of options
#Another way would be to start with amara.create_document
skel = '<!--XLinq Contacts XML Example--><?MyApp 123-44-4444?><contacts/>'
doc = amara.parse(skel)
doc.contacts.xml_append_fragment("""<contact>
  <name>Patrick Hines</name>
  <phone>206-555-0144</phone>
  <address>
    <street1>123 Main St</street1>
    <city>Mercer Island</city>
    <state>WA</state>
    <postal>68042</postal>
  </address>
</contact>
""")

'2. Creating an XML element in the "http://example.com" namespace'

doc.xml_create_element(u'contacts', u'http://example.com')

'3. Loading an XML element from a file'

amara.parse_path('c:\myContactList.xml')

'4. Writing out an array of Person objects as an XML file'

persons = {}
persons[u'Patrick Hines'] = [u'206-555-0144', u'425-555-0145']
persons[u'Gretchen Rivas'] = [u'206-555-0163']
doc.xml_create_element(u'contacts')
for name in persons:
    doc.contacts.xml_append_fragment('<person><name>%s</name></person>'%name)
    for phone in persons[name]:
        doc.contacts.person[-1].xml_append_fragment('<phone>%s</phone>'%phone)
print doc.xml()

'5. Print out all the element nodes that are children of the <contact> element'

for c in contact.xml_child_elements():
    print c.xml()

'6. Print all the <phone> elements that are children of the <contact> element'

for c in contact.xml_xpath(u'phone'):
    print c.xml()

'7. Adding a <phone> element as a child of the <contact> element'

contacts.xml_append_fragment('<phone>%s</phone>'%'206-555-0168')

'8. Adding a <phone> element as a sibling of another <phone> element'

mobile = contacts.xml_create_element(u'phone', content=u'206-555-0168')
first = contacts.phone
contacts.xml_insert_after(first, mobile)

'9. Adding an <address> element as a child of the <contact> element'

contacts.xml_append_fragment("""  <address>
    <street1>123 Main St</street1>
    <city>Mercer Island</city>
    <state>WA</state>
    <postal>68042</postal>
  </address>
""")

'10. Deleting all <phone> elements under a <contact> element'

for p in contact.phone: contact.xml_remove_child(p)

'11. Delete all children of the <address> element which is a child of the <contact> element'

contacts.contact.address.xml_clear()

'12. Replacing the content of the <phone> element under a <contact> element'

#Not really necessary: just showing how to clear the content
contact.phone.xml_clear()
contact.phone = u'425-555-0155'

'13. Alternate technique for replacing the content of the <phone> element under a <contact> element'

contact.phone = u'425-555-0155'

'14. Creating a contact element with attributes multiple phone number types'

#I'm sure it's clear by now how easy this would be with xml_append_fragment
#So here is the more analogous API approach
contact = contacts.xml_create_element(u'contact')
contact.xml_append(contact.xml_create_element(u'name', content=u'Patrick Hines'))
contact.xml_append(
    contact.xml_create_element(u'phone',
                               attributes={u'type': u'home'},
                               content=u'206-555-0144'))
contact.xml_append(
    contact.xml_create_element(u'phone',
                               attributes={u'type': u'work'},
                               content=u'425-555-0145'))

'15. Printing the value of the <phone> element whose type attribute has the value "home"'

print u'Home phone is:', contact.xml_xpath(u'phone[@type="home"]')

'16. Deleting the type attribute of the first <phone> element under the <contact> element'

del contact.phone.type

'17. Transforming our original <contacts> element to a new <contacts> element containing a list of <contact> elements whose children are <name> and <phoneNumbers>'

new_contacts = doc.xml_create_element(u'contacts')
for c in doc.contacts.contact:
    new_contacts.xml_append_fragment('''<contact>
    <name>%s</name>
    <phoneNumbers/>
    </contact>'''%c.name)
    for p in c.phone:
        new_contacts.phoneNumbers.xml_append(p)

'18. Retrieving the names of all the contacts from Washington, sorted alphabetically '

wash_contacts = contacts.xml_xpath(u'contact[address/state="WA"]')
names = [ unicode(c.name) for c in contacts.contact ]
names.sort()

[Uche Ogbuji]

via Copia