parse nested html lists using lxml in python

Question

I am trying to parse the elements of an html list which looks like this:

<ol>
    <li>r1</li>
    <li>r2
        <ul>
            <li>n1</li>
            <li>n2</li>
        </ul>
    </li>
    <li>r3
        <ul>
            <li>d1
                <ol>
                    <li>e1</li>
                    <li>e2</li>
                </ol>
            </li>
            <li>d2</li>
        </ul>
    </li>
    <li>r4</li>
</ol>

I am fine with parsing this for the most part, but the biggest problem for me is in getting the dom text back. Unfortunately lxml's node.text_content() returns the text form of the complete tree under it. Can I obtain the text content of just that element using lxml, or would I need to use string manipulation or regex for that?

For eg: the node with d1 returns "d1e1e2", whereas, I want it to return just d1.

Answer 1

Each node has an attribute called text . That's what you are looking for.

eg:

for node in root.iter("*"):
    print node.text
    # print node.tail # e.g.: <div> <span> abc </span> def </div> => abc def

parse nested html lists using lxml in python

Question

1 answers

solution1
2 ACCPTED 2012-11-08 00:57:07

parse nested html lists using lxml in python

Question

1 answers

solution1 2 ACCPTED 2012-11-08 00:57:07

solution1
2 ACCPTED 2012-11-08 00:57:07