I am looking to parse links out of a website using bs4. I was trying to avoid using regex.
def generate_url(day, year, month):
url = f"http://hockey-reference.com/boxscores/?year={year}&month={month}&day={day}"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
return soup
soup = generate_url(13,2021,1)
html_links = soup.find_all('td', class_ = 'right gamelink')
My result is a list with the html embedded...
[<td class="right gamelink">
<a href="/boxscores/202101130COL.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130EDM.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130PHI.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130TBL.html">F<span class="no_mobile">inal</span></a>
</td>,
<td class="right gamelink">
<a href="/boxscores/202101130TOR.html">F<span class="no_mobile">inal</span></a>
</td>]
What are the best ways to extract these links?
Append your code with iterating through html_links
and getting href
from them:
url = 'http://hockey-reference.com'
for html_link in html_links:
link = html_link.findChild('a')['href']
print(url + link)
If you only want to get links containing "boxscores" use this:
from bs4 import BeautifulSoup
import requests
import re
a = requests.get("https://www.hockey-reference.com/boxscores/?year=2021&month=1&day=13")
soup = BeautifulSoup(a.text, features="html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("boxscores")}):
print(link['href'])
Output:
There are a lot of empty links, if you only want the ones what say /boxscores/2021, just simply change the re.compile to "boxscores/2021".
This uses the re module to find "boxscores" in the link, so be sure to import re
.
Furthermore, if you wish to get all links from the webpage, use this:
for link in soup.find_all('a', href=True):
print(link['href'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.