Parsing XML

1 October 2007

Originally published on macresearch.org, around 2007. Reproduced from the author's archive; some links may no longer resolve.

Using XML in Your Research

The eXtensible Markup Language has really taken off over the last few years, to the point that you can’t escape it in any visit to the computer book store. But while XML has thoroughly permeated the world of IT, its presence in the scientific world is far less pervasive, which is a shame, because it offers a number of advantages over do-it-yourself formats. For example, because XML is standardized, it is supported by many tools and libraries, making parsing and probing XML documents a breeze. In this short tutorial, I want to show you how you can use XML to develop interfaces to legacy applications.

The format of XML is actually pretty straightforward. If you have ever seen HTML code, you are already acquainted with an XML format. XML basically allows you to define elements, such as the

tag that defines a paragraph element in HTML. Each element can have attributes, like the href attribute in the HTML anchor tag (e.g. ), and embedded sub-elements and text.

To demonstrate, I am going to walk through a real example. I recently needed to use output from a Fortran program in some python scripts. The Fortran program wrote its output in a form that was not very easy to parse, making it difficult to use in other programs and scripts. So I added a second mode of output that generated XML. The XML printed looked like this:

<kffile>
   <section id='General'>
      <variable id='file-ident' length='6' type='3' />
      <variable id='jobid' length='160' type='3' />
      <variable id='title' length='160' type='3' />
      <variable id='Molecular_Weight' length='1' type='2' />
      <variable id='runtype' length='160' type='3' />
      <variable id='nspin' length='1' type='1' />
      <variable id='nspinf' length='1' type='1' />
      <variable id='ldapot' length='1' type='1' />
      <variable id='xcparv' length='1' type='2' />
      <variable id='ldaen' length='1' type='1' />
      <variable id='xcpare' length='1' type='2' />
      <variable id='ggapot' length='160' type='3' />
      <variable id='ggaen' length='160' type='3' />
      <variable id='lhybrid' length='1' type='4' />
      <variable id='hybrid' length='160' type='3' />
      <variable id='iopcor' length='1' type='1' />
      <variable id='ioprel' length='1' type='1' />
      <variable id='electrons' length='1' type='2' />
      <variable id='unit of length' length='1' type='2' />
      <variable id='unit of angle' length='1' type='2' />
      <variable id='lfrozend' length='1' type='4' />
      <variable id='scfmod' length='160' type='3' />
   </section>
   <section id='Geometry'>
      <variable id='grouplabel' length='160' type='3' />
      <variable id='Geometric Symmetry' length='160' type='3' />
      <variable id='symmetry tolerance' length='1' type='2' />
      <variable id='orient' length='12' type='2' />
...

This is a fairly basic XML document. Basically, there is an all enclosing kffile element, which contains nested section elements. Each section element in turn nests a number of variable elements. The section and variable elements each have one or more attributes. For example, a variable has an identifier (id), data length (length), and data type (type). (Note that the data itself is not included in this particular scheme, but could easily be added.)

The advantage of using an XML format to dump structured data is that reusing that data then becomes a breeze. For example, here is a python program to read in the data above, and print out all of the section and variable names:

#!/usr/bin/env python

from xml.dom.minidom import *
dom = parse('dump.xml')
for section in dom.getElementsByTagName('section'):
    print section.getAttribute('id')
    for variable in section.getElementsByTagName('variable'):
        print '  ', variable.getAttribute('id')

That’s all. Tiny. The parsing itself is a single line of code. And don’t think that this is only possible in Python, it’s not. Libraries to parse XML are commonplace in nearly all languages, and typically just as easy to use.

You can parse XML in a number of ways. One option is a so called SAX parser, that basically walks through the XML tree, and calls a function for each element that it encounters. The advantage of this approach is that you don’t need to read the whole document into memory.

The parser used here is a so-called Document Object Model (DOM) parser. It reads in the whole document, and represents it internally as a tree-like structure called the DOM-tree. Once you have this tree, you can do anything you like with it. You can traverse it, like we have done here, and you can even modify and print it back out again. The advantage of a DOM parser is that it is typically much easier to perform operations on the data. A disadvantage is that you have to load the whole document, which could be an issue if you have lots of data.

That’s it for this brief introduction to XML. In conclusion, adding an XML output option to your legacy C or Fortran application can be a simple way to make it much more useful, by providing better integration with other applications and scripting languages.