简体   繁体   English

使用 bs4 和 python 抓取链接

[英]Scraping links using bs4 and python

I am looking to parse links out of a website using bs4.我正在寻找使用 bs4 解析网站中的链接。 I was trying to avoid using regex.我试图避免使用正则表达式。

def generate_url(day, year, month):
   url = f"http://hockey-reference.com/boxscores/?year={year}&month={month}&day={day}"
   page = requests.get(url)
   soup = BeautifulSoup(page.content, 'lxml')
   return soup

soup = generate_url(13,2021,1)
html_links = soup.find_all('td', class_ = 'right gamelink')

My result is a list with the html embedded...我的结果是嵌入了 html 的列表...

[<td class="right gamelink">
<a href="/boxscores/202101130COL.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130EDM.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130PHI.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130TBL.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130TOR.html">F<span class="no_mobile">inal</span></a>
</td>]

What are the best ways to extract these links?提取这些链接的最佳方法是什么?

Append your code with iterating through html_links and getting href from them: Append 您的代码通过html_links迭代并从中获取href

url = 'http://hockey-reference.com'
for html_link in html_links:
    link = html_link.findChild('a')['href']
    print(url + link)

If you only want to get links containing "boxscores" use this:如果您只想获取包含“boxscores”的链接,请使用:

from bs4 import BeautifulSoup
import requests
import re

a = requests.get("https://www.hockey-reference.com/boxscores/?year=2021&month=1&day=13")
soup = BeautifulSoup(a.text, features="html.parser")

for link in soup.find_all('a', attrs={'href': re.compile("boxscores")}):
    print(link['href'])

Output: Output:

在此处输入图像描述

There are a lot of empty links, if you only want the ones what say /boxscores/2021, just simply change the re.compile to "boxscores/2021".有很多空链接,如果您只想要显示 /boxscores/2021 的链接,只需将 re.compile 更改为“boxscores/2021”即可。

This uses the re module to find "boxscores" in the link, so be sure to import re .这使用 re 模块在链接中查找“boxscores”,因此请务必import re

Furthermore, if you wish to get all links from the webpage, use this:此外,如果您希望从网页中获取所有链接,请使用以下命令:

for link in soup.find_all('a', href=True):
    print(link['href'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM