简体   繁体   English

添加DataFrame列默认值时,如何将其限制为特定行?

[英]When adding a DataFrame column default value, how do I limit it to specific rows?

I am using a combination of beautifulsoup and pandas to try and get sports reference data by looping through boxscore pages, obtaining the dataframes for each team and concatenating them all together. 我正在使用beautifulsoup和pandas的组合,通过循环boxscore页面,获取每个团队的数据框并将它们串联在一起来尝试获取体育参考数据。 I noticed that the way the table is formatted on each page, there are row dividers separating the starters from the reserves, and this row divider has the value "Reserves" in the 'Starter' column (which I later rename to 'Player_Name'), with the remaining column headers repeated for the rest of its values. 我注意到表格在每一页上的格式化方式,有行分隔符将启动程序与储备分开,并且此行分隔符在“起始程序”列中具有值“ Reserves”(我后来将其重命名为“ Player_Name”) ,其余的列标题重复其其余值。 When this data is input into the dataframe, the row dividers are brought in as a normal row. 当此数据输入到数据帧中时,将行分隔符作为普通行引入。 I would like to add a separate column that holds a Y/N value for whether or not that player started the game and remove all records where the 'Starters' column is equal to "Reserves". 我想添加一个单独的列,该列包含该玩家是否开始游戏的Y / N值,并删除“启动器”列等于“储备金”的所有记录。

I have tried adding a column but I'm struggling with a method to get the default values to be "Y" for the first x number of rows and "N" for the remaining rows. 我曾尝试添加一列,但是我在努力使用一种方法来获取默认值:对于前x个行,默认值为“ Y”,对于其余行,默认值为“ N”。

Here is a brief example of the table followed by the code I am using. 这是表格的简短示例,后面是我正在使用的代码。 Let me know if you have any thoughts! 让我知道您是否有任何想法!

EDIT: I may have oversimplified this, as there are actually two header columns and it appears this is causing an issue when trying the solutions presented. 编辑:我可能已经简化了这一点,因为实际上有两个标题列,并且在尝试提出的解决方案时,这似乎引起了问题。 How can I remove the first header column that just states 'Basic Box Score Stats' and 'Advanced Box Score Stats'? 如何删除仅显示“基本框分数统计”和“高级框分数统计”的第一标题栏?

Basic Box Score Stats            Advanced Box Score Stats
Starters              MP    FG   +/-  xyz%
Player1               20:00 17   5    12
Player2               15:00 8    4    10
Player3               10:00 9    3    8
Player4               9:00  3    2    6
Player5               8:00  1    1    4
Reserves              MP    FG   +/-  xyz%
Player4               7:00  1    1    2
Player5               4:00  1    1    2
Player6               3:30  1    1    2
import pandas as pd
from bs4 import BeautifulSoup
#performed steps in bs4 to get the links to individual boxscores
    for boxscore_link in boxscore_links:
        basketball_ref_dfs=pd.read_html(MainURL + boxscore_link)
        if len(basketball_ref_dfs) = 4:
            away_team_stats = pd.concat([basketball_ref_dfs[0],basketball_ref_dfs[1]])
            home_team_stats = pd.concat([basketball_ref_dfs[2],basketball_ref_dfs[3]])
        else:
            away_team_stats = basketball_ref_dfs[0]
            home_team_stats = basketball_ref_dfs[1]
#new code to be added here to fix 'reserve' row header for away/home_team_stats        
full_game_stats = pd.concat([away_team_stats,home_team_stats])
        full_season_stats = full_season_stats.append(full_game_stats,ignore_index=True)
    full_season_stats

#what I want:
away_team_stats['Starter']='Y' # + some condition to only set this value for the first x occurrences or set to 'Y' until row value equals Reserve, then set remaining to 'N'

You can do this in three steps: 您可以通过三个步骤执行此操作:

  1. Set the default value 'N' for the entire column using away_team_stats['Starter']='N' 使用away_team_stats['Starter']='N'为整个列设置默认值away_team_stats['Starter']='N'
  2. Set the value for the first x rows to be 'Y' using the iloc method with away_team_stats.iloc[:x, 2]='Y' (I believe the 'Starter' column will be in position 2 if appending to your example data but you may need to edit this) 使用iloc方法使用away_team_stats.iloc[:x, 2]='Y'将前x行的值设置为away_team_stats.iloc[:x, 2]='Y' (我相信如果附加到示例数据中,“ Starter”列将位于位置2)但您可能需要进行编辑)
  3. Remove the row with 'Player_Name' == 'Reserves' by using the loc method with away_team_stats = away_team_stats.loc[away_team_stats['Player_Name']!='Reserves', :] 通过将loc方法与away_team_stats = away_team_stats.loc[away_team_stats['Player_Name']!='Reserves', :]一起使用'Player_Name' == 'Reserves'删除行

The iloc method will slice your dataframe by numerical index/column and the loc method will slice your dataframe by index/column label iloc方法将按数字索引/列对数据iloc进行切片,而loc方法将按索引/列标签对数据iloc进行切片

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

You can do, if you know the index already where your 'Reserve' value appears, let's say in this case it appears in the 10th record. 您可以这样做,如果您已经知道索引中“保留”值的显示位置,那么在这种情况下,它出现在第10条记录中。 I initially set everything to 'N', and then turn the first 10 rows to 'Y'. 我最初将所有内容设置为“ N”,然后将前10行设置为“ Y”。

away_team_stats['Starter'] = 'N'
away_team_stats.loc[:9, 'Starter'] = 'Y'


Or you can do: 或者,您可以执行以下操作:

 idx = away_team_stats.loc[away_team_stats['Starter'] == 'Reserve'].index[0] 

This gives you in which index 'Reserve' appears for the first time. 这使您可以在其中首次显示索引“预留”。

You can now do it as above: 您现在可以如上所述进行操作:

 away_team_stats.loc[:idx, 'Starter'] = 'Y' away_team_stats.loc[idx+1:, 'Starter'] = 'N' 

Sets the first few rows until the word 'Reserve' appeared for the first time to 'Y', and then sets the remaining to 'N'. 将前几行设置为直到第一次出现单词“ Reserve”时才将其设置为“ Y”,然后将其余行设置为“ N”。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 按特定列中的值频率限制 DataFrame 行 - Limit DataFrame rows by value frequency in specific column 如何向前填充 dataframe 列,其中填充的行数限制基于另一列中单元格的值? - How can I forward fill a dataframe column where the limit of rows filled is based on the value of a cell in another column? 如何覆盖DataFrame中特定索引/列的值? - How do I overwrite the value of a specific index/column in a DataFrame? 如何在特定列中按值查找 DataFrame with.loc? - How do I lookup a DataFrame with .loc by value in a specific column? 如何比较特定列中两行的值? - How do I compare the value of two rows in a specific column? 满足特定条件时为特定列的行添加值 - Adding value to rows of a specific column when a specific conditions is met 给定来自另一列的条件,如何遍历特定 Pandas DataFrame 列的行? - How do I iterate over the rows of a specific Pandas DataFrame column, given a condition from another column? 我在 .csv 文件的特定列中有 MongoDB 格式的字符串行 如何将其转换为数据帧? - I have rows of strings in the MongoDB format in a specific column in the .csv file How do I convert it into dataframe? 当特定列包含向我发出信号应删除该行的值时,如何删除Pandas数据框中的行? - How do I delete a row in Pandas dataframe when a specific column contains a value that signals to me that the row should be deleted? 在 pandas dataframe 中,如何根据列值过滤行,进行计算并将结果分配给新列? - In a pandas dataframe, how can I filter the rows based on a column value, do calculation and assign the result to a new column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM