To: xml-sig@python.org Cc: string-sig@python.org Subject: Shallow Parsing with Regular Expressions Hello, I ran across an interesting article titled "REX: XML Shallow Parsing with Regular Expressions" by Robert D. Cameron that I thought others following the XML-SIG and String-SIG might find intriguing. From the article's abstract: "The syntax of XML is simple enough that it is possible to parse an XML document into a list of its markup and text items using a single regular expression. Such a shallow parse of an XML document can be very useful for the construction of a variety of lightweight XML processing tools. However, complex regular expressions can be difficult to construct and even more difficult to read. Using a form of literate programming for regular expressions, this paper documents a set of XML shallow parsing expressions that can be used a basis for simple, correct, efficient, robust and language-independent XML shallow parsing. Complete shallow parser implementations of less than 50 lines each in Perl, JavaScript and Lex/Flex are also given." The paper was just published in Markup Languages: Theory and Practice, Volume 1, Number 3, Summer 1999, pp. 61-88 (MIT Press) but is available online as a technical report at ftp://fas.sfu.ca/pub/cs/TR/1998/CMPT1998-17.html The regex he creates ends up being about 1K (DFA and efficient), and can match just about any part of and XML document you're interested in. What you do next (entity reference extraction, attribute processing, error detection, etc.) is left to the reader. ;-) The Perl version of the regex works as-is with Python (of course). I've adjusted for Python syntax and attached 'REX.py' to this message for anyone interested. The article is clearly written and the method he uses to build up the regex out of small, well-reasoned pieces is very instructive (although his retangle and reweave literate programming tools aren't detailed). As a simple example, >>> import REX >>> print REX.ShallowParse('This is shallow parsing.') ['', 'This is ', '', 'shallow parsing', '', '.', ''] As a simple example of what can be done, I made a variation on the regex that adds named groups, e.g., (?P...), to each of the components (see ftp://starship.python.net/pub/crew/dni/REX/REX_detail.py ). The group names that are not None show exactly which parts of the regex contributed to the match, effectively categorizing each part of the document. Prettyprinted output for the string 'my &first; shallow parse' looks like D:\python> REX_detail.py MarkupSPE: '' ElemTagCE: 'tag1 att="123" att2="456">' Name: 'att2' NameStrt: 'a' NameChar: '2' AttValSE: '"456"' TextSE: 'my &first; ' MarkupSPE: '' ElemTagCE: 'i>' TextSE: 'shallow parse' MarkupSPE: '' EndTagCE: 'i>' MarkupSPE: '' EndTagCE: 'tag1>' More detail than you need, but you can see where further processing (for attributes, entities) would start. --David Niergarth