简体   繁体   中英

get sub elements with xpath of lxml.html (Python)

I am trying to get sub element with lxml.html, the code is as below.

import lxml.html as LH

html = """
<ul class="news-list2">
            <li>
            <div class="txt-box">
            <p class="info">Number:<label>cewoilgas</label></p>
            </div>
            </li>

            <li>
            <div class="txt-box">
            <p class="info">Number:<label>NHYQZX</label>
            </p>
            </div>
            </li>

        <li>
            <div class="txt-box">
            <p class="info">Number:<label>energyinfo</label>
            </p>
            </div>
            </li>

        <li>
            <div class="txt-box">
            <p class="info">Number:<label>calgary_information</label>
            </p>
            </div>
            </li>

        <li>
            <div class="txt-box">
            <p class="info">Number:<label>oilgas_pro</label>
            </p>
            </div>
            </li>

</ul>
"""

To get the sub element in li:

htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
    print li.xpath("//p/label/text()")

Curious why the outcome is

['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']

And I also found the solution is:

htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
    print li.xpath(".//p/label/text()")

the result is:

['cewoilgas']
['NHYQZX']
['energyinfo']
['calgary_information']
['oilgas_pro']

Should this be regarded as a bug for lxml? why xpath still match through the whole root element (ul) while it is under the sub-element (li)?

No, this is not a bug, but is an intended behavior . If you start your expression with // , it does not matter if you call it on the root of the tree or on any element of the tree - it is going to be absolute and it is going to be applied from the root.

Just remember, if calling xpath() on an element and you want it to work relative from this element, always start your expressions with a dot which would refer to a current node .

By the way, absolutely (pun intended) the same happens in selenium and it's find_element(s)_by_xpath() .

//para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node

//olist/item selects all the item elements in the same document as the context node that have an olist parent

. selects the context node

.//para selects the para element descendants of the context node

you can find more example in XML Path Language (XPath)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM