I am trying to scrape a table off of a website using Python and BeautifulSoup4. I then want to output the table, but I want to skip the first 5 columns of the table. Here is my code
def scrape_data():
url1 = "https://basketball-reference.com/leagues/NBA_2020_advanced.html"
html1 = urlopen(url1)
soup1 = bs(html1, 'html.parser')
soup1.findAll('tr', limit = 2)
headers1 = [th.getText() for th in soup1.findAll('tr', limit = 2)[0].findAll('th')]
headers1 = headers1[5:]
rows1 = soup1.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows1[i].findAll('td')]for i in range(len(rows1))]
stats1 = pd.DataFrame(player_stats, columns=headers1)
return stats1
And the error I get is ValueError: 24 columns passed, passed data had 28 columns
I know the error is coming from stats1 = pd.DataFrame(player_stats, columns=headers1)
But how do I fix it?
Just use iloc
on the resulting dataframe. Note that read_html
returns a list of dataframes, although there is only one per this url. You need to access this single dataframe via pd.read_html(url)[0]
. Then just use iloc
to ignore the first five columns.
url = "https://basketball-reference.com/leagues/NBA_2020_advanced.html"
df = pd.read_html(url)[0].iloc[:, 5:]
I solved it thanks to some help from @JonClements. My working code is
def scrape_data():
url1 = "https://basketball-reference.com/leagues/NBA_2020_advanced.html"
html1 = urlopen(url1)
soup1 = bs(html1, 'html.parser')
soup1.findAll('tr', limit = 2)
headers1 = [th.getText() for th in soup1.findAll('tr', limit = 2)[0].findAll('th')]
headers1 = headers1[5:]
rows1 = soup1.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows1[i].findAll('td')[4:]]for i in range(len(rows1))]
stats1 = pd.DataFrame(player_stats, columns=headers1)
return stats1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.