使用 Beautifulsoup 和 Mechanize 从元素中解析 href 属性值

Question

谁能帮我遍历一棵 html 有美汤的树？

我正在尝试解析 html output 并在收集每个值之后使用 python/django 插入名为Tld的表

<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>

并且只解析<a>的href属性的值，所以只有这部分：

https://billing.anapp.com/

的：

<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>

我目前有：

for url in urls:
    mb.open(url)
    beautifulSoupObj = BeautifulSoup(mb.response().read())
    beautifulSoupObj.find_all('h3',attrs={'class': 'r'})

问题是上面的find_all ，它离<a>元素还不够远。

任何帮助深表感谢。 谢谢你。

Answer 1

from bs4 import BeautifulSoup

html = """
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
"""

bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
    print(i.attrs["href"])

印刷品：

https://billing.anapp.com/

h3.ra是一个CSS选择器

您可以使用CSS选择器（我更喜欢它们），xpath或在元素中查找。 选择器h3.ra将查找类r所有h3并从它们内部获取a元素。 它可能是一个更复杂的示例，例如#an_id table tr.the_tr_class td.the_td_class它会在给定td的内部找到一个id，该id属于给定类的tr，并且在一个表中。

这也会给您相同的结果。 find_all返回列表bs4.element.Tag ， find_all具有递归场不知道你是否可以在一行做到这一点，我本人来说更喜欢CSS选择器，因为它的简单，清洗容易。

for elm in  bs.find_all('h3',attrs={'class': 'r'}):
    for a_elm in elm.find_all("a"):
        print(a_elm.attrs["href"])

Answer 2

我认为值得一提的是，如果存在包含空格的类似命名类，会发生什么情况。 获取@Foo Bar 用户提供的一段代码并稍作更改

from bs4 import BeautifulSoup

html = """
<div class="rc" data-hveid="53">
<h3 class="r s">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
<h3 class='r s sth s'>
<a href="https://link_you_dont_want.com/">Don't grab this</a>
</h3>
"""

bs = BeautifulSoup(html)

当我们尝试通过 css 选择器获取 class 等于“rs”的链接时：

elms = bs.select("h3.r.s a")
for i in elms:
    print(i.attrs["href"])

它打印

https://billing.anapp.com/
https://link_you_dont_want.com/

然而使用

for elm in  bs.find_all('h3',attrs={'class': 'r s'}):
    for a_elm in elm.find_all("a"):
        print(a_elm.attrs["href"])

给出了期望的结果

https://billing.anapp.com/

这只是我在自己的工作中遇到的事情。 如果有办法使用 css 选择器来克服这个问题，请告诉我！

使用 Beautifulsoup 和 Mechanize 从元素中解析 href 属性值

问题描述

2 个解决方案

解决方案1
6 已采纳 2013-11-14 16:43:19

解决方案2
0 2022-03-14 09:10:49

使用 Beautifulsoup 和 Mechanize 从元素中解析 href 属性值

问题描述

2 个解决方案

解决方案1 6 已采纳 2013-11-14 16:43:19

解决方案2 0 2022-03-14 09:10:49

解决方案1
6 已采纳 2013-11-14 16:43:19

解决方案2
0 2022-03-14 09:10:49