[英]Parsing text after bold tag using XPath
我在Python中使用Xpath提取文本。 文本的結構如下:
<b>Field1:</b>" Value1" <br>
<b>Field2:</b>" Value2" <br><br>
<b>Field3:</b>" Value3" <br><br>
<b>Field4:</b>" Value4" <br>
<b>Field5:</b>" Value5" <br><br>
請注意,換行符(br標簽)的數量可能會不一致
我要提取:
Field 1: Value 1
Field 2: Value 2
Field 3: Value 3
Field 4: Value 4
Field 5: Value 5
目前,我的XPath // b / text()正在提取字段,而不是值。
請幫忙。
您可以使用BeautifulSoup
HTML解析器解決它,它是.next_sibling
:
from bs4 import BeautifulSoup
data = """
<div>
<b>Field1:</b>" Value1" <br>
<b>Field2:</b>" Value2" <br><br>
<b>Field3:</b>" Value3" <br><br>
<b>Field4:</b>" Value4" <br>
<b>Field5:</b>" Value5" <br><br>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
for b in soup.find_all("b"):
label = b.get_text(strip=True)
value = b.next_sibling.strip()
print(label, value)
或者,使用lxml.html
和following-sibling
軸:
from lxml.html import fromstring
data = """
<div>
<b>Field1:</b>" Value1" <br>
<b>Field2:</b>" Value2" <br><br>
<b>Field3:</b>" Value3" <br><br>
<b>Field4:</b>" Value4" <br>
<b>Field5:</b>" Value5" <br><br>
</div>
"""
root = fromstring(data)
for b in root.xpath("//b"):
label = b.text_content()
value = b.xpath("following-sibling::text()")[0].strip()
print(label, value)
假設您使用的是lxml
,則可以使用tail
屬性獲取元素后面的文本:
>>> import lxml.html
>>>
>>> root = lxml.html.fromstring('''
... <html>
... <body>
... <b>Field1:</b>" Value1" <br>
... <b>Field2:</b>" Value2" <br><br>
... <b>Field3:</b>" Value3" <br><br>
... <b>Field4:</b>" Value4" <br>
... <b>Field5:</b>" Value5" <br><br>
... </body>
... </html>
... ''')
>>> for b in root.xpath('//b'):
... print('{} {}'.format(b.text, b.tail.strip('" '))) # <---
...
Field1: Value1
Field2: Value2
Field3: Value3
Field4: Value4
Field5: Value5
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.