[英]Find text using lxml etree
I'm trying to get a text from one tag using lxml etree
. 我正在尝试使用
lxml etree
从一个标签获取文本。
<div class="litem__type">
<div>
Robbp
</div>
<div>Estimation</div>
+487 (0)639 14485653
•
<a href="mailto:herbrich@gmail.com">
Email Address
</a>
•
<a class="external" href="http://www.google.com">
Homepage
</a>
</div>
The problem is that I can't locate it because there are many differences between this kind of snippets. 问题是我找不到它,因为这种代码段之间有很多差异。 There are situations, when the first and second
div
is not there at all. 在某些情况下,第一和第二个
div
根本不存在。 As you can see, the telephone number is not in it's own div
. 如您所见,电话号码不在自己的
div
。
I suppose that it would be possible to extract the telephone using BeautifulSoups
contents but I'm trying to use lxml
module's xpath
. 我想可以使用
BeautifulSoups
内容提取电话,但是我正在尝试使用lxml
模块的xpath
。
Do you have any ideas? 你有什么想法? (email don't have to be there sometimes)
(电子邮件有时不必在那里)
EDIT: The best idea is probably to use regex
but I don't know how to tell it that it should extract just text between two <div></div>
编辑:最好的主意可能是使用
regex
但我不知道如何告诉它应该只提取两个<div></div>
之间的文本
You should avoid using regex to parse XML/HTML wherever possible because it is not as efficient as using element trees. 您应尽可能避免使用正则表达式来解析XML / HTML,因为它不如使用元素树有效。
The text after element A's closing tag, but before element B's opening tag, is called element A's tail
text. 元素A的结束标记之后但元素B的开始标记之前的文本称为元素A的
tail
文本。 To select this tail
text using lxml etree
you could do the following: 要使用
lxml etree
选择该tail
文本,可以执行以下操作:
content = '''
<div class="litem__type">
<div>Robbp</div>
<div>Estimation</div>
+487 (0)639 14485653
<a href="mailto:herbrich@gmail.com">Email Address</a>
<a class="external" href="http://www.google.com">Homepage</a>
</div>'''
from lxml import etree
tree = etree.XML(content)
phone_number = tree.xpath('div[2]')[0].tail.strip()
print(phone_number)
Output 产量
'+487 (0)639 14485653'
The strip()
function is used here to remove whitespace on either side of the tail
text. 这里使用
strip()
函数删除tail
文本两侧的空白。
You can iterate and get text after div tag. 您可以在div标签之后进行迭代并获取文本。
from lxml import etree
tree = etree.parse("filename.xml")
items = tree.xpath('//div')
for node in items:
# you can check here if it is a phone number
print node.tail
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.