简体   繁体   English

使用lxml etree查找文本

[英]Find text using lxml etree

I'm trying to get a text from one tag using lxml etree . 我正在尝试使用lxml etree从一个标签获取文本。

<div class="litem__type">
            <div>
                Robbp
            </div>


                    <div>Estimation</div>

                 +487 (0)639 14485653


                                •
                                <a href="mailto:herbrich@gmail.com">
                                    Email Address
                                </a>



                    •
                    <a class="external" href="http://www.google.com">
                        Homepage
                    </a>


        </div>

The problem is that I can't locate it because there are many differences between this kind of snippets. 问题是我找不到它,因为这种代码段之间有很多差异。 There are situations, when the first and second div is not there at all. 在某些情况下,第一和第二个div根本不存在。 As you can see, the telephone number is not in it's own div . 如您所见,电话号码不在自己的div

I suppose that it would be possible to extract the telephone using BeautifulSoups contents but I'm trying to use lxml module's xpath . 我想可以使用BeautifulSoups内容提取电话,但是我正在尝试使用lxml模块的xpath

Do you have any ideas? 你有什么想法? (email don't have to be there sometimes) (电子邮件有时不必在那里)

EDIT: The best idea is probably to use regex but I don't know how to tell it that it should extract just text between two <div></div> 编辑:最好的主意可能是使用regex但我不知道如何告诉它应该只提取两个<div></div>之间的文本

You should avoid using regex to parse XML/HTML wherever possible because it is not as efficient as using element trees. 您应尽可能避免使用正则表达式来解析XML / HTML,因为它不如使用元素树有效。

The text after element A's closing tag, but before element B's opening tag, is called element A's tail text. 元素A的结束标记之后但元素B的开始标记之前的文本称为元素A的tail文本。 To select this tail text using lxml etree you could do the following: 要使用lxml etree选择该tail文本,可以执行以下操作:

content = '''
<div class="litem__type">
    <div>Robbp</div>
    <div>Estimation</div>
    +487 (0)639 14485653
    <a href="mailto:herbrich@gmail.com">Email Address</a>
    <a class="external" href="http://www.google.com">Homepage</a>
</div>'''

from lxml import etree

tree = etree.XML(content)
phone_number = tree.xpath('div[2]')[0].tail.strip()
print(phone_number)

Output 产量

'+487 (0)639 14485653'

The strip() function is used here to remove whitespace on either side of the tail text. 这里使用strip()函数删除tail文本两侧的空白。

You can iterate and get text after div tag. 您可以在div标签之后进行迭代并获取文本。

from lxml import etree
tree = etree.parse("filename.xml")
items = tree.xpath('//div')
for node in items:
    # you can check here if it is a phone number
    print node.tail

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM