lxml 

Send to Kindle
home » snippets » python » lxml



Snippets

Getting all text from inside an element

From: ElementTree: Bits and Pieces

The text attribute contains the text immediately inside an element, but it does not include text inside subelements.  To get all text, you can use something like:

def gettext(elem):
    text = elem.text or ""
    for e in elem:
        text += gettext(e)
        if e.tail:
            text += e.tail
    return text

Removing elements

From: ElementTree: Bits and Pieces

If you're using ElementTree 1.3, then the serialization code will leave out the tags for elements that have their tag attribute set to None.

To remove an element from a tree, you have to replace the element with its contents.  This includes not only the subelements, but also the text and tail attributes.

The following function takes a tree and a filter function, and removes all subelements for which the filter returns false.

def cleanup(elem, filter):
    out = []
    for e in elem:
        cleanup(e, filter)
        if not filter(e):
            if e.text:
                if out:
                    out[-1].tail += e.text
                else:
                    elem.text += e.text
            out.extend(e)
            if e.tail:
                if out:
                    out[-1].tail += e.tail
                else:
                    elem.text += e.tail
        else:
            out.append(e)
    elem[:] = out

Note that the top element itself isn’t checked; if you need to remove that, you have to do that at the application level.

Instead of writing a filter function, you can iterate over the tree and set the tag to None for the elements you want to remove.  When you’ve checked all elements, call the cleanup function as follows:

cleanup(elem, lambda e: e.tag)