[英]Regex required or can BeautifulSoup refine output
If I use the following function I can grab the text and link I need from a website: 如果我使用以下功能,我可以从网站上获取我需要的文字和链接:
def get_url_text(url):
source = requests.get(url)
plain_text = source.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('li', {'class': 'ptb2'}):
print(item_name.string)
print (item_name.a)
get_url_text('https://www.residentadvisor.net/podcast.aspx')
returns: 收益:
RA.532 Marquis Hawkes
<a href="/podcast-episode.aspx?id=532"><h1>RA.532 Marquis Hawkes</h1></a>
RA.531 Evan Baggs
<a href="/podcast-episode.aspx?id=531"><h1>RA.531 Evan Baggs</h1></a>
RA.530 MCDE vs Jeremy Underground
If I only want the href link instead of the tags etc surrounding it do I need to use a regex or is there another method within BeautifulSoup? 如果我只想要href链接而不是周围的标签等我是否需要使用正则表达式或者BeautifulSoup中还有其他方法吗?
Desired output is: 期望的输出是:
RA.532 Marquis Hawkes
https://www.residentadvisor.net/podcast-episode.aspx?id=532
for each similar element. 对于每个相似的元素。
您可以使用print(item_name.a['href'])
和(如果需要)前缀https://www.residentadvisor.net
(因为网页中的链接使用的形式没有显式方案和netloc部分 -例如,/ /podcast-episode.aspx?id=528
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.