The row is a duplicate of the header row. The row occurs over and over again randomly, and I do not want it in the data set (naturally). I think the HTML page has it there to remind the viewer what column attributes they are looking at as they scroll down.
Below is a sample of one of the row elements I want delete:
<tr class ="thead" data-row="25>
Here is another one:
<tr class="thead" data-row="77">
They occur randomly, but if there's any way we could make a loop that can iterate and find the first cell in the row and determine that it is in fact the row we want to delete? Because they are identical each time. The first cell is always "Player", identifying the attribute. Below is an example of what that looks like as an HTML element.
<th aria-label="Player" data-stat="player" scope="col" class=" poptip sort_default_asc center">Player</th>
Maybe I can create a loop that iterates through each row and determines if that first cell says "Player". If it does, then delete that whole row. Is that possible?
Here is my code so far:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import string
years = list(range(2023, 2024))
alphabet = list(string.ascii_lowercase)
url_namegather = 'https://www.basketball-reference.com/players/a'
lastname_a = 'a'
url = url_namegather.format(lastname_a)
data = requests.get(url)
with open("player_names/lastname_a.html".format(lastname_a), "w+", encoding="utf-8") as f:
f.write(data.text)
with open("player_names/lastname_a.html", encoding="utf-8") as f:
page = f.read()
soup = BeautifulSoup(page, "html.parser")
You can read the table directly using pandas
. You may need to install lxml
package though.
# read tables from the url
df_list = pd.read_html('https://www.basketball-reference.com/players/a')
# Select the first DataFrame from the list of DataFrames
df = df_list[0]
This will get data without any duplicated header rows.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.