使用 bs4 和 python 抓取链接

Question

我正在寻找使用 bs4 解析网站中的链接。 我试图避免使用正则表达式。

def generate_url(day, year, month):
   url = f"http://hockey-reference.com/boxscores/?year={year}&month={month}&day={day}"
   page = requests.get(url)
   soup = BeautifulSoup(page.content, 'lxml')
   return soup

soup = generate_url(13,2021,1)
html_links = soup.find_all('td', class_ = 'right gamelink')

我的结果是嵌入了 html 的列表...

[<td class="right gamelink">
<a href="/boxscores/202101130COL.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130EDM.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130PHI.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130TBL.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130TOR.html">F<span class="no_mobile">inal</span></a>
</td>]

提取这些链接的最佳方法是什么？

Answer 1

Append 您的代码通过html_links迭代并从中获取href ：

url = 'http://hockey-reference.com'
for html_link in html_links:
    link = html_link.findChild('a')['href']
    print(url + link)

Answer 2

如果您只想获取包含“boxscores”的链接，请使用：

from bs4 import BeautifulSoup
import requests
import re

a = requests.get("https://www.hockey-reference.com/boxscores/?year=2021&month=1&day=13")
soup = BeautifulSoup(a.text, features="html.parser")

for link in soup.find_all('a', attrs={'href': re.compile("boxscores")}):
    print(link['href'])

Output：

有很多空链接，如果您只想要显示 /boxscores/2021 的链接，只需将 re.compile 更改为“boxscores/2021”即可。

这使用 re 模块在链接中查找“boxscores”，因此请务必import re 。

此外，如果您希望从网页中获取所有链接，请使用以下命令：

for link in soup.find_all('a', href=True):
    print(link['href'])

使用 bs4 和 python 抓取链接

问题描述

2 个解决方案

解决方案1
0 2021-01-23 22:33:05

解决方案2
0 2021-01-23 22:52:22

使用 bs4 和 python 抓取链接

问题描述

2 个解决方案

解决方案1 0 2021-01-23 22:33:05

解决方案2 0 2021-01-23 22:52:22

解决方案1
0 2021-01-23 22:33:05

解决方案2
0 2021-01-23 22:52:22