简体   繁体   中英

How do I decompose() a reoccurring row in a table that I find located in an html page using Python?

The row is a duplicate of the header row. The row occurs over and over again randomly, and I do not want it in the data set (naturally). I think the HTML page has it there to remind the viewer what column attributes they are looking at as they scroll down.

Below is a sample of one of the row elements I want delete:

<tr class ="thead" data-row="25>

Here is another one:

<tr class="thead" data-row="77">

They occur randomly, but if there's any way we could make a loop that can iterate and find the first cell in the row and determine that it is in fact the row we want to delete? Because they are identical each time. The first cell is always "Player", identifying the attribute. Below is an example of what that looks like as an HTML element.

<th aria-label="Player" data-stat="player" scope="col" class=" poptip sort_default_asc center">Player</th>

Maybe I can create a loop that iterates through each row and determines if that first cell says "Player". If it does, then delete that whole row. Is that possible?

Here is my code so far:

  from bs4 import BeautifulSoup
    import pandas as pd
    
    import requests
    import string
    
    years = list(range(2023, 2024))
    
    alphabet = list(string.ascii_lowercase)
    
    url_namegather = 'https://www.basketball-reference.com/players/a'
    lastname_a = 'a'
    url = url_namegather.format(lastname_a)
    data = requests.get(url)
    with open("player_names/lastname_a.html".format(lastname_a), "w+", encoding="utf-8") as f:
        f.write(data.text)
    
    with open("player_names/lastname_a.html", encoding="utf-8") as f:
        page = f.read()
    
    soup = BeautifulSoup(page, "html.parser")

You can read the table directly using pandas . You may need to install lxml package though.

# read tables from the url
df_list = pd.read_html('https://www.basketball-reference.com/players/a')
# Select the first DataFrame from the list of DataFrames
df = df_list[0]

This will get data without any duplicated header rows.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM