简体   繁体   中英

parse nested html lists using lxml in python

I am trying to parse the elements of an html list which looks like this:

<ol>
    <li>r1</li>
    <li>r2
        <ul>
            <li>n1</li>
            <li>n2</li>
        </ul>
    </li>
    <li>r3
        <ul>
            <li>d1
                <ol>
                    <li>e1</li>
                    <li>e2</li>
                </ol>
            </li>
            <li>d2</li>
        </ul>
    </li>
    <li>r4</li>
</ol>

I am fine with parsing this for the most part, but the biggest problem for me is in getting the dom text back. Unfortunately lxml's node.text_content() returns the text form of the complete tree under it. Can I obtain the text content of just that element using lxml, or would I need to use string manipulation or regex for that?

For eg: the node with d1 returns "d1e1e2", whereas, I want it to return just d1.

Each node has an attribute called text . That's what you are looking for.

eg:

for node in root.iter("*"):
    print node.text
    # print node.tail # e.g.: <div> <span> abc </span> def </div> => abc def

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM