I would like to webscrape all of the "boxscore" hyperlinks found in the webpage highlighted in "requests.get" below and have it printed onto an excel spreadsheet. However, the program below prints all the text found under the class "game" from the webpage. What needs to be changed so that it prints only the href-boxscore found within "em" elements under the class "game"?
import requests
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import load_workbook
wb = load_workbook("tennis_input3.xlsx")
ws = wb.active
response = requests.get('https://www.baseball-reference.com/leagues/majors/2010-schedule.shtml')
webpage = response.content
soup = BeautifulSoup(response.text, "html.parser")
col1 = soup.find_all("p", class_="game")
print(pd.DataFrame({"MatchLink":col1}))
df = pd.DataFrame({"MatchLink":col1})
df.to_excel("tennis_3.xlsx", sheet_name="welcome")
Select your elements more specific and as described by your self:
soup.select('p.game em a')
or
soup.select('p.game a[href*=boxes]')
import requests
from bs4 import BeautifulSoup
import pandas as pd
response = requests.get('https://www.baseball-reference.com/leagues/majors/2010-schedule.shtml')
soup = BeautifulSoup(response.text)
pd.DataFrame(
['https://www.baseball-reference.com'+e.get('href') for e in soup.select('p.game em a')],
columns = ['url']
)#.to_excel(...)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.