The task is to parse a simple XML document, and analyze the contents by line number.
The right Python package seems to be xml.sax. But how do I use it?
After some digging in the documentation, I found:
- The
xmlreader.Locatorinterface has the information:getLineNumber(). - The
handler.ContentHandlerinterface hassetDocumentHandler().
The first thought would be to create a Locator, pass this to the ContentHandler, and read the information off the Locator during calls to its character() methods, etc.
BUT, xmlreader.Locator is only a skeleton interface, and can only return -1 from any of its methods.
So as a poor user, WHAT am I to do, short of writing a whole Parser and Locator of my own??
I'll answer my own question presently.
(Well I would have, except for the arbitrary, annoying rule that says I can't.)
I was unable to figure this out using the existing documentation (or by web searches), and was forced to read the source code for xml.sax(under /usr/lib/python2.7/xml/sax/ on my system).
The xml.sax function make_parser() by default creates a real Parser, but what kind of thing is that?
In the source code one finds that it is an ExpatParser, defined in expatreader.py.
And...it has its own Locator, an ExpatLocator. But, there is no access to this thing.
Much head-scratching came between this and a solution.
- write your own
ContentHandler, which knows about aLocator, and uses it to determine line numbers - create an
ExpatParserwithxml.sax.make_parser() - create an
ExpatLocator, passing it theExpatParserinstance. - make the
ContentHandler, giving it thisExpatLocator - pass the
ContentHandlerto the parser'ssetContentHandler() - call
parse()on theParser.
For example:
import sys
import xml.sax
class EltHandler( xml.sax.handler.ContentHandler ):
def __init__( self, locator ):
xml.sax.handler.ContentHandler.__init__( self )
self.loc = locator
self.setDocumentLocator( self.loc )
def startElement( self, name, attrs ): pass
def endElement( self, name ): pass
def characters( self, data ):
lineNo = self.loc.getLineNumber()
print >> sys.stdout, "LINE", lineNo, data
def spit_lines( filepath ):
try:
parser = xml.sax.make_parser()
locator = xml.sax.expatreader.ExpatLocator( parser )
handler = EltHandler( locator )
parser.setContentHandler( handler )
parser.parse( filepath )
except IOError as e:
print >> sys.stderr, e
if len( sys.argv ) > 1:
filepath = sys.argv[1]
spit_lines( filepath )
else:
print >> sys.stderr, "Try providing a path to an XML file."
Martijn Pieters points out below another approach with some advantages.
If the superclass initializer of the ContentHandler is properly called,
then it turns out a private-looking, undocumented member ._locator is
set, which ought to contain a proper Locator.
Advantage: you don't have to create your own Locator (or find out how to create it).
Disadvantage: it's nowhere documented, and using an undocumented private variable is sloppy.
Thanks Martijn!