正则表达式需要或可以BeautifulSoup精炼输出

Question

If I use the following function I can grab the text and link I need from a website: 如果我使用以下功能，我可以从网站上获取我需要的文字和链接：

def get_url_text(url):
    source = requests.get(url)
    plain_text = source.text
    soup = BeautifulSoup(plain_text)
    for item_name in soup.findAll('li', {'class': 'ptb2'}):
        print(item_name.string)
        print (item_name.a)

get_url_text('https://www.residentadvisor.net/podcast.aspx')

returns: 收益：

RA.532 Marquis Hawkes
<a href="/podcast-episode.aspx?id=532"><h1>RA.532 Marquis Hawkes</h1></a>
RA.531 Evan Baggs
<a href="/podcast-episode.aspx?id=531"><h1>RA.531 Evan Baggs</h1></a>
RA.530 MCDE vs Jeremy Underground

If I only want the href link instead of the tags etc surrounding it do I need to use a regex or is there another method within BeautifulSoup? 如果我只想要href链接而不是周围的标签等我是否需要使用正则表达式或者BeautifulSoup中还有其他方法吗？

Desired output is: 期望的输出是：

RA.532 Marquis Hawkes
https://www.residentadvisor.net/podcast-episode.aspx?id=532

for each similar element. 对于每个相似的元素。

Answer 1

您可以使用print(item_name.a['href'])和（如果需要）前缀https://www.residentadvisor.net （因为网页中的链接使用的形式没有显式方案和netloc部分 -例如，/ /podcast-episode.aspx?id=528 ）

正则表达式需要或可以BeautifulSoup精炼输出

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-09-07 21:07:06

正则表达式需要或可以BeautifulSoup精炼输出

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-09-07 21:07:06

解决方案1
3 已采纳 2016-09-07 21:07:06