\\n";
echo "\\t
Revision History | |
---|---|
Revision 1.0 | 12-29-2006 |
Initial Draft |
This document describes how to use the XmlTreeWalker product. XmlTreeWalker works in conjunction with expat. It can be used for filtering and content modification of XML documents by employing various Operators and having them activate based on specified Filter rules. Uses include content extraction, suppression and rewriting.
This is an initial release. It is also my first distribution ever, so it's likely to have various issues.
The following takes an xhtml document and parses the table of contents. The Printoperator and the DocbookTocOperator both filter the same part of the tree.
<html> <body> <div class="article"> <div class="toc"> <dl> parsed content including the dl tag </dl> </div> </div> </body> </html>
The PrinOperator reprints everything except the parsed content. The DocbookTocOperator only deals with the parsed content and extracts out a
<dl> <dt> <a href='url'> name </a> </dt> <dd> <dl> <dt> ....
type definition list and returns a list of TocItems representing the TOC data parsed.
import codecs in = codecs.open(path, 'rb', 'utf-8') orig = in.read() in.close() flt = Filter.getInstance([['html'], ['body'], ['div', {'class': 'article'}], ['div', {'class':'toc'}], ['dl'], ]) flt2 = Filter.getInstance([['html'], ['body'], ['div', {'class': 'article'}], ['div', {'class':'toc'}], ]) data = StringIO() dto = DocbookTocOperator(False, [flt]) ops = [dto, PrintOperator(True, [flt2], data)] wlk = XMLTreeWalker(ops) wlk.parse(orig) got = data.getvalue() data.close() output = codecs.open(outPath, 'w', 'utf-8') output.write(got) output.close() print >>> sys.stdout(dto.rootItem.prettyPrint())