简体   繁体   English

Python:Pandas - 仅删除 NaN 行并向上移动数据,不向上移动具有部分 NaN 的行中的数据

[英]Python : Pandas - ONLY remove NaN rows and move up data, do not move up data in rows with partial NaNs

Alright, so here is my code that I'm currently drafting to pull all national league players fielding stats.好吧,这是我目前正在起草的代码,用于提取所有国家联盟球员的上场数据。 It works fine, however, I am interested in knowing how to drop ONLY lines of NaNs in dataframes without disturbing any of the data:它工作正常,但是,我有兴趣知道如何在不干扰任何数据的情况下仅删除数据帧中的 NaN 行:

# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# create a url object
url = r'https://www.baseball-reference.com/leagues/NL/2022-standard-fielding.shtml'

# create list of the stats that we care about
standardFieldingStats = [
    'player',
    'team_ID',
    'G',
    'GS',
    'CG',
    'Inn_def',
    'chances',
    'PO',
    'A',
    'E_def',
    'DP_def',
    'fielding_perc',
    'tz_runs_total',
    'tz_runs_total_per_season',
    'bis_runs_total',
    'bis_runs_total_per_season',
    'bis_runs_good_plays',
    'range_factor_per_nine',
    'range_factor_per_game',
    'pos_summary'
]

# Create object page
page = requests.get(url)

# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')

# grab each teams current year batting stats and turn it into a dataframe
tableNLFielding = soup.find('table', id='players_players_standard_fielding_fielding')

# grab player UID
puidList = []
rows = tableNLFielding.select('tr')
for row in rows:
    playerUID = row.select_one('td[data-append-csv]')
    playerUID = playerUID.get('data-append-csv')if playerUID else None
    if playerUID == None:
        continue
    else:
        puidList.append(playerUID)

# grab players position
compList = []
for row in rows:
    thingList = []
    for stat in range(len(standardFieldingStats)):
        thing = row.find("td", attrs={"data-stat" : standardFieldingStats[stat]})
        if thing == None:
            continue
        elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Team Totals':
            continue
        elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Rank in 15 NL teams':
            continue
        elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Rank in 15 AL teams':
            continue
        elif thing.text == '':
            continue
        elif thing.text == 'NaN':
            continue
        else:
            thingList.append(thing.text)
    compList.append(thingList)

# insert the batting headers to a dataframe
NLFieldingDf = pd.DataFrame(data=compList, columns=standardFieldingStats)

#NLFieldingDf = NLFieldingDf.apply(lambda x: pd.Series(x.dropna().values))

#NLFieldingDf = NLFieldingDf.apply(lambda x: pd.Series(x.fillna('').values))

# make all NaNs blanks for aesthic reasons
#NLFieldingDf = NLFieldingDf.fillna('')

#NLFieldingDf.insert(loc=0, column='pUID', value=puidList)

An example is: Dataframe I want to remove NaNs from:一个示例是:Dataframe 我想从以下位置删除 NaN:

player             team   pos_summary
NaN                NaN    NaN
Brandon Woodruff   NaN    P   
William Woods      ATL    NaN
Kyle Wright        ATL    P

My dataframe when I try looks like this, moving the data out of place:当我尝试时,我的 dataframe 看起来像这样,将数据移到了别处:

player             team   pos_summary
Brandon Woodruff   ATL    P   
William Woods      ATL    P
Kyle Wright

Ideally, I want this, but no NaN rows and maintaining rows with partial NaNs:理想情况下,我想要这个,但没有 NaN 行并维护具有部分 NaN 的行:

player             team   pos_summary
Brandon Woodruff          P   
William Woods      ATL    
Kyle Wright        ATL    P

Refer to the end of the complete code to see my attempts.完整代码参考末尾看我的尝试。

try this to remove all NaN rows试试这个删除所有 NaN 行

df.dropna(how="all") df.dropna(如何=“全部”)

Further, if you need to replace the NaN values with '', then use此外,如果您需要用 '' 替换 NaN 值,则使用

df.fillna('', inplace=True) df.fillna('', inplace=True)

You could do it that way, however, your data isn't accurate.您可以那样做,但是,您的数据不准确。 You shouldn't be getting nulls in player position or team.你不应该在玩家 position 或团队中得到空值。

Secondly, if you need to parse <table> tags (and you don't need to pull out any attributes like a href) let pandas parse that table for you.其次,如果您需要解析<table>标签(并且您不需要提取任何属性,如 href),让pandas为您解析该表。 It uses beautifulsoup under the hood.它在引擎盖下使用 beautifulsoup。

import pandas as pd

url = r'https://www.baseball-reference.com/leagues/NL/2022-standard-fielding.shtml'
df = pd.read_html(url)[-1]
df = df[df['Rk'].ne('Rk')]   

Output: Output:

print(df[['Name', 'Tm', 'Pos Summary']])
                 Name   Tm Pos Summary
0         C.J. Abrams  SDP    SS-2B-OF
1    Ronald Acuna Jr.  ATL          OF
2        Willy Adames  MIL          SS
3        Austin Adams  SDP           P
4         Riley Adams  WSN        C-1B
..                ...  ...         ...
509     Miguel Yajure  PIT           P
510  Mike Yastrzemski  SFG          OF
511  Christian Yelich  MIL          OF
512        Juan Yepez  STL          OF
513      Huascar Ynoa  ATL           P

[495 rows x 3 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM