Does XML give away the keys to the data warehouse?

"Does XML give away the keys to the warehouse?"

In this ADT article I reflect on some of the implications of XPath injection attacks, and to what extent XML and open data are a danger to developers.

While I don’t claim to have foreseen XPath injection attacks, it does strike me that this security problem is made possible by practices that I and others have always discouraged. One problem is the phenomenon of production XML as database dump. Developers love to create titanic XMLfiles, often as monolithic dumps from databases. Sometimes they deploy such monsters to servers susceptible to the cleverness of attackers.

If someone does compromise the server, they can pilfer one file and have your information warehouse at their hands.

I wrote this article a long time ago, and I actually didn't know if it would be published, because of editorial changes at ADT. I just discovered it by accident yesterday. I'm glad to see it "in print".

[Uche Ogbuji]

via Copia

Domlette and Saxlette: huge performance boosts for 4Suite (and derived code)

For a while 4Suite has had an 80/20 DOM implementation completely in C: Domlette (formerly named cDomlette). Jeremy has been making a lot of performance tweaks to the C code, and current CVS is already 3-4 times faster than Domlette in 4Suite 1.0a4.

In addition, Jeremy stealthily introduced a new feature to 4Suite, Saxlette. Saxlette uses the same Expat C code Domlette uses, but exposes it as SAX. So we get SAX implemented completely in C. It follows the Python/SAX API normally, so for example the following code uses Saxlette to count the elements:

from xml import sax

furi = "file:ot.xml"

class element_counter(sax.ContentHandler):
    def startDocument(self):
        self.ecount = 0

    def startElementNS(self, name, qname, attribs):
        self.ecount += 1

parser = sax.make_parser(['Ft.Xml.Sax'])
handler = element_counter()
parser.setContentHandler(handler)
parser.parse(furi)
print "Elements counted:", handler.ecount

If you don't care about PySax compatibility, you can use the more specialized API, which involves the following lines in place of the equivalents above:

from Ft.Xml import Sax
...
class element_counter():
....
parser = Sax.CreateParser()

The code changes needed from the first listing above to regular PySax are minimal. As Jeremy puts it:

Unlike the distributed PySax drivers, Saxlette follows the SAX2 spec and defaults feature_namespaces to True and feature_namespace_prefixes to False both of which are not allowed to be changed (which is exactly what SAX2 says is required). Python/SAX defaults to SAX1 behavior and Saxlette defaults to SAX2 behavior.

The following is a PySax example:

from xml import sax

furi = "file:ot.xml"

#Handler has to derive from sax.ContentHandler,'
#or, in practice, implement all interfaces
class element_counter(sax.ContentHandler):
    def startDocument(self):
        self.ecount = 0

    #SAX1 startElement by default, rather than SAX2 startElementNS
    def startElement(self, name, attribs):
        self.ecount += 1

parser = sax.make_parser()
handler = element_counter()
parser.setContentHandler(handler)
parser.parse(furi)
print "Elements counted:", handler.ecount

The speed difference is huge. Jeremy did some testing with timeit.py (using more involved test code than the above), and in those limited tests Saxlette showed up as fast as, and in some cases a bit faster than cElementTree and libxml/Python (much, much faster than xml.sax in all cases). Interestingly, Domlette is now within 30%-40% of Saxlette in raw speed, which is impressive considering that it is building a fully functional DOM. As I've said in the past, I'm done with the silly benchmarks game, so someone else will have to pursue matters to further detail if they really can't do without their hot dog eating contests.

In another exciting development Saxlette has gained a generator mode using Expat's suspend/resume capability. This means you can have a Saxlette handler yield results from the SAX callbacks. It will allow me, for example, to have Amara's pushdom and pushbind work without threads, eliminating a huge drag on their performance (context switching is basically punishment). I'm working this capability into the code in the Amara 1.2 branch. So far the effects are dramatic.

[Uche Ogbuji]

via Copia

Xampl, re: "XML data bindings, static languages, dynamic languages"

In response to XML data bindings, static languages, dynamic languages Bob Hutchison posted some thoughts. As I used Amara as the kernel of my demonstrations, Bob used his project xampl as the kernel of his. He introduces xampl in another entry which was inspired by my own article on EaseXML.

Xampl is a an XML data binding. As Bob writes:

Secondly, there are versions of xampl for Java and Common Lisp. I’ve got an old (summer 2002) version for Ruby that needs updating (I wrote the xampl-pp pull parser to support this experiment).

Bob says that Xampl also deals with things that Elliotte Harold mentions as usual scourges for Java data bindings: mixed content, repeated elements, omitted elements, and element order. Of course these things should be food and drink to any XML tool, and I'm glad folks are finally plugging such gaping holes. Eric van der Vlist is also in the game with TreeBind, and it seems some Java tools try to wriggle out of the pinch by using XQuery.

Based on Bob's snippets, Xampl looks handy. Rather verbose, but no more so than Java pretty much requires. One thing that strikes me in Bob's examples is that Xampl appears to require and create a bogon namespace (http://www.xampl.com/labelsExample). It seems maybe it has something to do with Java packaging or something, but regardless of the role of this fake namespace, the XML represented by Xampl in Bob's snippets is not the same as the XML in the original source examples. An unprefixed element in a namespace is of course not the same thing as an element in the null namespace. I would not accept any tool that involves such a mix-up. It's quite possible that Xampl does not do so, and I'm just misunderstanding Bob's examples.

Bob provides Xampl code to match the EaseXML snippets in my article. Similarly to how EaseXML requires Python framework code, Xampl requires XML framework code. Since "XML situps" have been on the wires lately, they come to mind for a moment, but hey, if you're already processing XML with Xampl, I suppose you might not flinch at one more XML. I will point out that Amara does not require any framework code whatsoever besides the XML itself, not even an XML schema. It effectively provides dynamic code generation.

Xampl turns XML constructs into Java getters, e.g. html.getHead(). Amara uses the Python convention of properties rather than getters and setters, so you have html.head, and you can even assign to this property in order to mutate the XML. Xampl looks neat. The things that turn me off are largely things that are pretty much inevitable in Java, not least the very large amount of code generated by the binding. It supports XPath, as Amara does, and provides a "rhino" option to expose XML objects through Javascript, which offers you a bit more of the flexibility of Python (I don't know how much overhead to expect from Javascript through Java through XML, but it's a question I'd be quick to ask as a user).

It's good to have projects such as Xampl and Treebind and Nux. I'd rather use Python tools such as Amara, Gnosis and GenerateDS, but Java has the visibility and it's good for people to be aware that XML does not necessarily require greater imprisonment of expression than what comes with the application language. You don't need to accept crazy idioms and stifling limitations in matters as fundamental as mixed content and element ordering. XML and sanity can coexist.

[Uche Ogbuji]

via Copia

Python + XML = wary coexistence

There has been quite a bit of discussion triggered by my article "Python and XML: Should Python and XML Coexist?". This sort of thing always surprises me more than it should. I like to post code-heavy articles and leave the philosophy to the occasional entry, or to this very Weblog, but it seems that people respond more vocally to philosophy than to code. Perhaps I'll discuss with Kendall, my editor, what this suggests in terms of future directions for my Python/XML column.

Anyway, first I point to PJE's response. I used quotes from his Weblog as jumping-off points for my article.

Uche Ogbuji liberally quotes from and analyzes two of my XML-v.-Python rants, and actually gets it completely right. Since at least one of those rants has been cited as meaning I think XML is the spawn of Satan, I'm glad Uche read closely enough to get the context and nuance, without projecting things into it that I didn't say. Kudos!

I don't claim to know whom PJE speaks of when he refers to other commentary on his rant, but Martijn Faassen indicated his own response. I do think that Martijn missed some of PJE's intended nuance, but to be fair, it took me more than one reading to catch that nuance. I think that PJE could have saved himself a lot of misunderstanding, but hell, I've had my turn at thickly nuanced rant myself, so I see both sides. Looking more broadly at the landscape, Martijn puts succinctly what I've said in the past.

This disdain for XML technologies is very common among Python programmers.

But maybe that means something greater than petty rivalry. Mike Champion brought up my article on XML-DEV:

For some time now we've seen the JSON "fat-free alternative to XML" direction that some in the AJAX world are taking to address both XML's inefficiency and the mismatch with programming languages. Now I see that many in the Python community have a similar attitude toward XML and encourage its use only when necessary to exchange data with non-Python apps.

He followed with a list of thoughts, touching on the likely roles of JSON, Python, XML, and more, and I responded. To much to quote from the exchange. Read the originals yourself, if you like. I will mention the final thought in my response:

In many ways I think a vicious backlash from programming languages against XML is just what XML needs right now.

In saying that, I had in mind some of my other prosaic articles about the direction of XML, including:

I think that many XML folks have been working to encroach on the territory of languages such as Python, even if Python folks aren't always clear on this fact while complaining about XML. We'll just have to see how it all shakes out. I know what pattern of tool usage I'll stick to for now. Speaking of omni-tools, Dimitre Novatchev put in a plug for XSLT as general-purpose programming language, which he's also done here in Copia comments. I still think it's a bad idea to treat XSLT as anything other than a template language. XSLT in its place, Python (or Javascript, Ruby, or whatever) in its place.

In the comments on my article there are some interesting bits, including one correspondent's mention of the importance of open file formats, and the XML's role in this, followed bewilderingly by:

C++ is so powerful that with the right classes, many of the advantages of a scripting language are attainable.

Sounds like someone who badly needs to actually try Python.

[Uche Ogbuji]

via Copia

Windows prebuilt binary package for Amara

Several Amara users have mentioned trying to build for Windows but running into problems, or a requirement for .NET. Sylvain Hellegouarch comes to the rescue with a binary package he built using Amara 1.0, latest 4Suite CVS, Python 2.4.1 and .NET 1.1 (latest patches). Windows users can install this package without having to worry about compilers or any other such hassles. Sylvain posted it under the default name generated by distutils, but I renamed it as appropriate and I'm hosting it on the Amara home page, and in the Amara contrib FTP area. Please let Sylvain and me know (by posting to the 4Suite mailing list with "[Amara]" in the subject) if you have any trouble with this package.

Meanwhile work on the 1.2 branch continues. I've got up to 30% speed and memory usage improvement over Amara Bindery 1.0, in large part by moving from PySax to 4Suite's undocumented Saxlette library (all in highly optimized C). Jeremy is also working on suspend/resume for 4Suite's parser, which would allow for a huge performance boost for pushbind. I'll try to start working on RELAX NG features if I get a chance this week.

[Uche Ogbuji]

via Copia

Python/XML column #36: Should Python and XML Coexist?

"Python and XML: Should Python and XML Coexist?"

In his latest Python and XML column, Uche Ogbuji claims that the costs of using XML as a little language in a Python application may outweigh the benefits of doing so. [Aug. 25, 2005]

In this article I discuss some of the recent round of complaints about XML in the Python community, trying to give perspective that Python and XML should serve very different domains. Treating them in competition for any particular task is often a more general problem of misunderstanding the basic nature of one technology or the other, and it often leads to overstated complaints.

A correspondent already asked one good follow-up question:

My question is simply: do you have a recommendation for an alternative language (or other protocol) that is more suitable for expressing data structures, preferably one that can be coded for reasonably quickly in Python?

YAML seems to be the front-runner for a cross-language data structure format, although JSON is hot these days, courtesy the AJAX hype. I tend to point to Paul Tchistopolskii's, "Alternatives to XML" for a more comprehensive list.

[Uche Ogbuji]

via Copia

2006 conferences, part 2

"Save the date: XTech 2006"

The foundations are in place for my favorite XML conference. Have I mentioned how hard XTech 2005 rocked? Hmm? Haven't I?. Ah. Indeed I have. The reviews were pretty much overwhelmingly bright.

And perhaps the best news of all is that the conference is still in Amsterdam, but it's moving from the sterile RAI to the Krasnapolsky Hotel on Dam Square. More time to frolic in that fly city center.

As for XML 2005, Atlanta, I'd thought briefly about going (submitting a late-breaking talk and all that), but I'm through my conference quota for the year, and looking through the program I honestly didn't find enough to inspire me to make the stretch. I do hope they follow Edd's lead and bring in some fresh blood (topics wise) for future conferences.

[Uche Ogbuji]

via Copia

Amara goes 1.0, gets simpler to install

I released Amara 1.0 today. It's been properly cooked for a while, but life always has its small interruptions. The big change in 1.0 is that I now offer it in a couple of package options.

For those who just need some XML processing, and don't really care about RDF, all you need is Python and Amara-1.0-allinone (grab it from the FTP site). It has a trimmed down subset of 4Suite bundled for one-step install.

For those who want the full complement of what 4Suite has to offer, or for those who've already installed 4Suite anyway, there is the stand alone Amara-1.0 package.

Here's the thing, though. Right now it feels to me that I should be pushing Amara-allinone and not Amara+4Suite. Amara-allinone contains all the 1.0 quality components of 4Suite right now. 4Suite is a combination of a rock solid core XML library, a somewhat out of date RDF library, and a quite rickety server framework. This has been bad for 4Suite. People miss out on the power of its core XML facilities because of the size and uneven quality of the rest of the package. In fact, 4Suite has been stuck in the 1.0 alpha and beta stages for ever, not because the core libraries aren't 1.0 quality (heck, they're 4.x quality), but because we keep hoping to bring the rest of the package up to scratch. It's been on my mind for a while that we should just split the package up to solve this problem. This is what I've done, in effect, with this Amara release.

As for the rest of 4Suite, the RDF engine just needs a parser update for the most recent RDF specifications. The XML/RDF repository probably doesn't need all that much more work before it's ready for 1.0 release. As for the protocol server, as I've said several times before, I think we should just chuck it. Better to use another server package, such as CherryPy.

As for Amara, I'll continue bug fixes on the 1.x branch, but the real fun will be on the 2.x branch where I'll be refactoring a few things.

[Uche Ogbuji]

via Copia

XSLT 2.0 and push/pull

I just finally got a chance to read Bob DuCharme's article "Push, Pull, Next!", which starts by referring to my "Push vs pull XSLT". It shows how one might use XSLT 2.0's xsl:next-match to stay with push in some instances where pull becomes attractive. This instruction is similar in idea to XSLT 1.0's xsl:apply-imports, except that it doesn't require you to organize templates into separate, imported files. It also supports xsl:with-param, which is also available in XSLT 2.0's version of xsl:apply-imports. Bob wasn't clear enough in his article that XSLT 1.0 also has xsl:apply-imports, but that's clarified int he comments. One important aspect of the use of these instructions in XSLT 2.0 is that xsl:with-param becomes so much more useful in the new version now that default template rules no longer discard parameters. XSLT 2.0 did manage here to squash one of the bigger gotchas in XSLT 1.0.

[Uche Ogbuji]

via Copia

Another small 4Suite MarkupWriter example: XHTML 1.1

I was writing code to emit XHTML 1.1 using 4Suite and just to double-check the doc types I looked at the spec. I thought it might be useful to write up a small MarkupWriter example for emitting the example in the spec.

from Ft.Xml.MarkupWriter import MarkupWriter
from xml.dom import XHTML_NAMESPACE, XML_NAMESPACE

XHTML_NS = unicode(XHTML_NAMESPACE)
XML_NS = unicode(XML_NAMESPACE)
XHTML11_SYSID = u"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"
XHTML11_PUBID = u"-//W3C//DTD XHTML 1.1//EN"

writer = MarkupWriter(indent=u"yes", doctypeSystem=XHTML11_SYSID,
                      doctypePublic=XHTML11_PUBID)
writer.startDocument()
writer.startElement(u'html', XHTML_NS, attributes={(u'xml:lang', XML_NS): u'en'})
writer.startElement(u'head', XHTML_NS)
writer.simpleElement(u'title', XHTML_NS, content=u'Virtual Library')
writer.endElement(u'head', XHTML_NS)
writer.startElement(u'body', XHTML_NS)
writer.startElement(u'p', XHTML_NS)
writer.text(u'Moved to ')
writer.simpleElement(u'a', XHTML_NS,
                     attributes={u'href': u'http://vlib.org/'},
                     content=u'vlib.org')
writer.text(u'.')
writer.endElement(u'p', XHTML_NS)
writer.endElement(u'body', XHTML_NS)
writer.endElement(u'html', XHTML_NS)
writer.endDocument()

It's worth mentioning that this example would be even simpler with template output facilities I've proposed for Amara.

[Uche Ogbuji]

via Copia