For a while 4Suite has had an 80/20 DOM implementation completely in C: Domlette (formerly named cDomlette). Jeremy has been making a lot of performance tweaks to the C code, and current CVS is already 3-4 times faster than Domlette in 4Suite 1.0a4.
In addition, Jeremy stealthily introduced a new feature to 4Suite, Saxlette. Saxlette uses the same Expat C code Domlette uses, but exposes it as SAX. So we get SAX implemented completely in C. It follows the Python/SAX API normally, so for example the following code uses Saxlette to count the elements:
from xml import sax furi = "file:ot.xml" class element_counter(sax.ContentHandler): def startDocument(self): self.ecount = 0 def startElementNS(self, name, qname, attribs): self.ecount += 1 parser = sax.make_parser(['Ft.Xml.Sax']) handler = element_counter() parser.setContentHandler(handler) parser.parse(furi) print "Elements counted:", handler.ecount
If you don't care about PySax compatibility, you can use the more specialized API, which involves the following lines in place of the equivalents above:
from Ft.Xml import Sax ... class element_counter(): .... parser = Sax.CreateParser()
The code changes needed from the first listing above to regular PySax are minimal. As Jeremy puts it:
Unlike the distributed PySax drivers, Saxlette follows the SAX2 spec and defaults
feature_namespaces
toTrue
andfeature_namespace_prefixes
toFalse
both of which are not allowed to be changed (which is exactly what SAX2 says is required). Python/SAX defaults to SAX1 behavior and Saxlette defaults to SAX2 behavior.
The following is a PySax example:
from xml import sax furi = "file:ot.xml" #Handler has to derive from sax.ContentHandler,' #or, in practice, implement all interfaces class element_counter(sax.ContentHandler): def startDocument(self): self.ecount = 0 #SAX1 startElement by default, rather than SAX2 startElementNS def startElement(self, name, attribs): self.ecount += 1 parser = sax.make_parser() handler = element_counter() parser.setContentHandler(handler) parser.parse(furi) print "Elements counted:", handler.ecount
The speed difference is huge. Jeremy did some testing with timeit.py (using more involved test code than the above), and in those limited tests Saxlette showed up as fast as, and in some cases a bit faster than cElementTree and libxml/Python (much, much faster than xml.sax in all cases). Interestingly, Domlette is now within 30%-40% of Saxlette in raw speed, which is impressive considering that it is building a fully functional DOM. As I've said in the past, I'm done with the silly benchmarks game, so someone else will have to pursue matters to further detail if they really can't do without their hot dog eating contests.
In another exciting development Saxlette has gained a generator mode using Expat's suspend/resume capability. This means you can have a Saxlette handler yield results from the SAX callbacks. It will allow me, for example, to have Amara's pushdom and pushbind work without threads, eliminating a huge drag on their performance (context switching is basically punishment). I'm working this capability into the code in the Amara 1.2 branch. So far the effects are dramatic.