\\n"; echo "\\t<key>WC_FIELD_VALUE</key>\\n"; ?> XMLTreeWalker

XMLTreeWalker

Mario Theodoridis

Revision History
Revision 1.012-29-2006
Initial Draft

Purpose

This document describes how to use the XmlTreeWalker product. XmlTreeWalker works in conjunction with expat. It can be used for filtering and content modification of XML documents by employing various Operators and having them activate based on specified Filter rules. Uses include content extraction, suppression and rewriting.

Status

This is an initial release. It is also my first distribution ever, so it's likely to have various issues.

Requirements

  • python 2.4
  • expat
  • testutils

Usage

The following takes an xhtml document and parses the table of contents. The Printoperator and the DocbookTocOperator both filter the same part of the tree.

<html>
    <body>
        <div class="article">
            <div class="toc">
                <dl>
                    parsed content including the dl tag
                </dl>
            </div>
        </div>
    </body>
</html>            
        

The PrinOperator reprints everything except the parsed content. The DocbookTocOperator only deals with the parsed content and extracts out a

            
<dl>
    <dt>
        <a href='url'>
            name
        </a>
    </dt>
    <dd>
        <dl>
            <dt>
                ....

type definition list and returns a list of TocItems representing the TOC data parsed.

import codecs
in = codecs.open(path, 'rb', 'utf-8')
orig = in.read()
in.close()

flt = Filter.getInstance([['html'], ['body'],
['div', {'class': 'article'}],
['div', {'class':'toc'}],
['dl'],
])

flt2 = Filter.getInstance([['html'], ['body'],
['div', {'class': 'article'}],
['div', {'class':'toc'}],
])

data = StringIO()
dto = DocbookTocOperator(False, [flt])
ops = [dto, PrintOperator(True, [flt2], data)]
wlk = XMLTreeWalker(ops)
wlk.parse(orig)

got = data.getvalue()
data.close()

output = codecs.open(outPath, 'w', 'utf-8')
output.write(got)
output.close()
            
print >>> sys.stdout(dto.rootItem.prettyPrint())
            
        
top test image img test image img1 test image