简体   繁体   English

使用lxml html从嵌套元素中提取特定元素

[英]extract specific element from nested elements using lxml html

Hi all I am having some problems that I think can be attributed to xpath problems. 大家好我有一些问题,我认为可以归结为xpath问题。 I am using the html module from the lxml package to try and get at some data. 我正在使用lxml包中的html模块来尝试获取一些数据。 I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier. 我提供下面最简化的情况,但请记住我正在使用的HTML更加丑陋。

<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>

What I really want is the deeply nested table, because it has the header text "Header1". 我真正想要的是深度嵌套的表,因为它有标题文本“Header1”。 I am trying like so: 我是这样想的:

from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')

but that gives me all of the table elements. 但这给了我所有的表格元素。 I just want the one table that contains this text. 我只想要包含此文本的一个表。 I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. 我明白发生了什么,但是除了打破一些讨厌的正则表达式之外我很难搞清楚如何做到这一点。 Any thoughts? 有什么想法吗?

用途

//td[text() = 'Header1']/ancestor::table[1]

Find the header you are interested in and then pull out its table. 找到您感兴趣的标题,然后拉出表格。

//u[b = 'Header1']/ancestor::table[1]

or 要么

//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]

Note that // always starts at the document root (!). 请注意// 始终从文档根目录(!)开始。 You can't do: 你做不到:

//table[//*[contains(text(), "Header1")]]

and expect the inner predicate ( //*… ) to magically start at the right context. 并期望内部谓词( //*… )神奇地从正确的上下文开始。 Use .// to start at the context node. 使用.//从上下文节点开始。 Even then, this: 即便如此,这个:

//table[.//*[contains(text(), "Header1")]]

won't work since even the outermost table contains the text 'Header1' somewhere deep down, so the predicate evaluates to true for every table in your example. 因为即使最外面的表在内部某处包含文本'Header1'也不会起作用,因此对于示例中的每个表,谓词的计算结果为true。 Use not() like I did to make sure no other tables are nested. 像我一样使用not()以确保没有嵌套其他表。

Also, don't test the condition on every node .//* , since it can't be true for every node to begin with. 另外,不要在每个节点上测试条件.//* ,因为每个节点都不能开始。 It's more efficient to be specific. 具体而言更有效率。

Perhaps this would work for you: 也许这对你有用:

tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")

The not(descendant::table) bit ensures that you're getting the innermost table. not(descendant::table)位确保您获得最里面的表。

table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
  • //*[text()="Header1"] selects an element anywhere in a document with text Header1 . //*[text()="Header1"]使用文本Header1选择文档中任何位置的元素。
  • ancestor::table[1] selects the first ancestor of the element that is table . ancestor::table[1]选择作为table的元素的第一个祖先。

Complete example 完整的例子

#!/usr/bin/env python
from lxml import html

page = """
<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>
"""

tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM