[英]Python BeautifulSoup extract text from SPAN and A tags
我想从 SPAN 和 A 标签中提取文本并放入如下模式的列表中:
['Farina', '500 g']['Uova', '1']['Sale', '100 g']
用 BeautifulSoup 抓取
from bs4 import BeautifulSoup
import re
import string
markup = """
<dd class="ingredient">
<a href="#">Farina</a>
<span>500 g</span>
</dd>
<dd class="ingredient">
<a href="#">Uova</a>
<span>1</span>
</dd>
<dd class="ingredient">
<a href="#">Sale</a>
<span>100 g</span>
</dd>
"""
soup = BeautifulSoup(markup, 'html.parser')
allIngredients = []
for tag in soup.find_all(attrs={'class' : 'ingredient'}):
#[tag.text for tag in tags]
link = tag.a.get('href')
nameIngredient = tag.a.string
contents = tag.span.text
quantityIngredient = re.sub(r"\s+", " ", contents).strip()
allIngredients.append([nameIngredient, quantityIngredient])
print(allIngredients)
有时 SPAN 可以为空或不存在
这是使用lxml
(而不是bs4
)的解决方案
from lxml import html
markup = """
<dd class="ingredient">
<a href="#">Farina</a>
<span>500 g</span>
</dd>
<dd class="ingredient">
<a href="#">Uova</a>
<span>1</span>
</dd>
<dd class="ingredient">
<a href="#">Sale</a>
<span>100 g</span>
</dd>
<dd class="ingredient">
<a href="#">Vino</a>
</dd>
"""
root = html.fromstring(markup)
result = []
for node in root.xpath(".//dd"):
a = node.xpath(".//a")
span = node.xpath(".//span")
result.append((
a[0].text_content() if a else None,
span[0].text_content() if span else None
))
print(result)
# [('Farina', '500 g'), ('Uova', '1'), ('Sale', '100 g'), ('Vino', None)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.