![](/img/trans.png)
[英]Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python
[英]extracting a field from HTML with BeautifulSoup4
我第一次使用BeautifulSoup4,并且陷入了必须简单明了的问题。 我有一个看起来像这样的元素标签:
<td class="stage" data-value="phase3">\n
\n Phase 3\n<svg height="5" viewbox="1 1 95 5" width="95"
xmlns="http://www.w3.org/2000/svg">\n<g fill="none" transform="translate(1 1
)">\n<rect fill="#911C36" height="5" rx="2" width="15"></rect>\n<rect
fill="#D6A960" height="5" rx="2" width="15" x="16"></rect>\n<rect
fill="#E7DE6F" height="5" rx="2" width="15" x="32"></rect>\n<rect fill="#ddd"
height="5" rx="2" width="15" x="48"></rect>\n<rect fill="#ddd" height="5"
rx="2" width="15" x="64"></rect>\n<rect fill="#ddd" height="5" rx="2"
width="15" x="80"></rect>\n</g>\n</svg> </td>
我想从“数据值”字段和填充颜色列表中提取值“ phase3”,例如
[ "#911C36", "#D6A960", ... ]
什么是正确的查询呢?
BS文档指定传递True
匹配任何值,无论其值如何。 这样的事情应该起作用:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<td class="stage" data-value="phase3">\n
\n Phase 3\n<svg height="5" viewbox="1 1 95 5" width="95"
xmlns="http://www.w3.org/2000/svg">\n<g fill="none" transform="translate(1 1
)">\n<rect fill="#911C36" height="5" rx="2" width="15"></rect>\n<rect
fill="#D6A960" height="5" rx="2" width="15" x="16"></rect>\n<rect
fill="#E7DE6F" height="5" rx="2" width="15" x="32"></rect>\n<rect fill="#ddd"
height="5" rx="2" width="15" x="48"></rect>\n<rect fill="#ddd" height="5"
rx="2" width="15" x="64"></rect>\n<rect fill="#ddd" height="5" rx="2"
width="15" x="80"></rect>\n</g>\n</svg> </td>
""", "html.parser")
colors = [x["fill"] for x in soup.findAll("rect", {"fill": True})]
data_vals = [x["data-value"] for x in soup.findAll("td", {"data-value": True})]
print(colors)
print(data_vals)
输出:
['#911C36', '#D6A960', '#E7DE6F', '#ddd', '#ddd', '#ddd']
['phase3']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.