alan little’s weblog

xml races (3)

22nd November 2005 permanent link

XML Races, Part Three. In Parts One and Two, we discovered that for parsing moderately large Apple plist files Fredrik Lundh’s cElementTree is very fast and memory-efficient, whereas python’s standard xml.dom.minidom and Ruby’s REXML are very slow and memory-inefficient.

I was curious to try out another good C implementation – libxml2 with one of its python bindings. I know I have had libxml2 working with python before, but now I have a feeling that may have been on a Windows machine at work, because now when I try to do it on a Mac I find myself firmly back in Open Source Dependency Hell. (No matter how many smooth and positive experiences you have with open source installations, you still always know Open Source Dependency Hell could be lurking behind the next download)

Mac OS X comes with the libxml2 C libraries installed by default, but can Mac OS X’s default python installation see them? It cannot. Can I find anywhere how to cause it to do so? I cannot. I try installing lxml, which is supposed to provide a nice ElementTree-style interface in place of libxml’s default low-level and rather fiddly C-style interface. lxml needs pyrex. I install pyrex. lxml collapses in a heap anyway when I try to compile it.

Oh well. According to Martijn Faassen, libxml might not be that fast with python anyway.

I give up on libxml for the time being, and think instead of Chris Petrilli’s comment that ruby (and python) performance is “not quite in the league of Smalltalk (or Lisp, likely), which have extremely mature VMs with on-the-fly compilation and optimization”. Is Smalltalk then much faster than python or ruby, or comparable with C, for the task of parsing moderately large XML files?

No. Time to load and parse my iTunes library file, an 11mb Apple plist, on a 1 GHz G4 Powerbook with VisualWorks Non-Commercial 7.3.1: about three minutes.

Much faster than REXML, a little faster than python’s default parser. A little slower than a good fast python implementation. Not even vaguely in contention with a good C implementation.

On the subject of unrealistic XML benchmarks, Uche Ogbuji rightly points out that “Nobody reads in a 3MB XML document just to throw all the data away”. True. But in the eight to ten minutes you might otherwise spend waiting for REXML to creep through the document, you can get an awful lot done with your data that you already parsed in four seconds with cElementTree.

UPDATE

James Robertson is surprised, and finds that on his Mac Mini my file loads and parses in 61.7 seconds. Roughly comparable machines – 1.25 Ghz versus my Powerbook’s 1GHz (and a faster frontside bus), but 256 MB versus my 512MB – so clearly something wrong with my test setup. Faster than elementtree, though still much slower than cElementTree. More. More more.

related entries: Programming

all text and images © 2003–2008