添加DataFrame列默认值时，如何将其限制为特定行？

Question

I am using a combination of beautifulsoup and pandas to try and get sports reference data by looping through boxscore pages, obtaining the dataframes for each team and concatenating them all together. 我正在使用beautifulsoup和pandas的组合，通过循环boxscore页面，获取每个团队的数据框并将它们串联在一起来尝试获取体育参考数据。 I noticed that the way the table is formatted on each page, there are row dividers separating the starters from the reserves, and this row divider has the value "Reserves" in the 'Starter' column (which I later rename to 'Player_Name'), with the remaining column headers repeated for the rest of its values. 我注意到表格在每一页上的格式化方式，有行分隔符将启动程序与储备分开，并且此行分隔符在“起始程序”列中具有值“ Reserves”（我后来将其重命名为“ Player_Name”），其余的列标题重复其其余值。 When this data is input into the dataframe, the row dividers are brought in as a normal row. 当此数据输入到数据帧中时，将行分隔符作为普通行引入。 I would like to add a separate column that holds a Y/N value for whether or not that player started the game and remove all records where the 'Starters' column is equal to "Reserves". 我想添加一个单独的列，该列包含该玩家是否开始游戏的Y / N值，并删除“启动器”列等于“储备金”的所有记录。

I have tried adding a column but I'm struggling with a method to get the default values to be "Y" for the first x number of rows and "N" for the remaining rows. 我曾尝试添加一列，但是我在努力使用一种方法来获取默认值：对于前x个行，默认值为“ Y”，对于其余行，默认值为“ N”。

Here is a brief example of the table followed by the code I am using. 这是表格的简短示例，后面是我正在使用的代码。 Let me know if you have any thoughts! 让我知道您是否有任何想法！

EDIT: I may have oversimplified this, as there are actually two header columns and it appears this is causing an issue when trying the solutions presented. 编辑：我可能已经简化了这一点，因为实际上有两个标题列，并且在尝试提出的解决方案时，这似乎引起了问题。 How can I remove the first header column that just states 'Basic Box Score Stats' and 'Advanced Box Score Stats'? 如何删除仅显示“基本框分数统计”和“高级框分数统计”的第一标题栏？

Basic Box Score Stats            Advanced Box Score Stats
Starters              MP    FG   +/-  xyz%
Player1               20:00 17   5    12
Player2               15:00 8    4    10
Player3               10:00 9    3    8
Player4               9:00  3    2    6
Player5               8:00  1    1    4
Reserves              MP    FG   +/-  xyz%
Player4               7:00  1    1    2
Player5               4:00  1    1    2
Player6               3:30  1    1    2

import pandas as pd
from bs4 import BeautifulSoup
#performed steps in bs4 to get the links to individual boxscores
    for boxscore_link in boxscore_links:
        basketball_ref_dfs=pd.read_html(MainURL + boxscore_link)
        if len(basketball_ref_dfs) = 4:
            away_team_stats = pd.concat([basketball_ref_dfs[0],basketball_ref_dfs[1]])
            home_team_stats = pd.concat([basketball_ref_dfs[2],basketball_ref_dfs[3]])
        else:
            away_team_stats = basketball_ref_dfs[0]
            home_team_stats = basketball_ref_dfs[1]
#new code to be added here to fix 'reserve' row header for away/home_team_stats        
full_game_stats = pd.concat([away_team_stats,home_team_stats])
        full_season_stats = full_season_stats.append(full_game_stats,ignore_index=True)
    full_season_stats

#what I want:
away_team_stats['Starter']='Y' # + some condition to only set this value for the first x occurrences or set to 'Y' until row value equals Reserve, then set remaining to 'N'

Answer 1

You can do this in three steps: 您可以通过三个步骤执行此操作：

Set the default value 'N' for the entire column using away_team_stats['Starter']='N' 使用away_team_stats['Starter']='N'为整个列设置默认值away_team_stats['Starter']='N'
Set the value for the first x rows to be 'Y' using the iloc method with away_team_stats.iloc[:x, 2]='Y' (I believe the 'Starter' column will be in position 2 if appending to your example data but you may need to edit this) 使用iloc方法使用away_team_stats.iloc[:x, 2]='Y'将前x行的值设置为away_team_stats.iloc[:x, 2]='Y' （我相信如果附加到示例数据中，“ Starter”列将位于位置2）但您可能需要进行编辑）
Remove the row with 'Player_Name' == 'Reserves' by using the loc method with away_team_stats = away_team_stats.loc[away_team_stats['Player_Name']!='Reserves', :] 通过将loc方法与away_team_stats = away_team_stats.loc[away_team_stats['Player_Name']!='Reserves', :]一起使用'Player_Name' == 'Reserves'删除行

The iloc method will slice your dataframe by numerical index/column and the loc method will slice your dataframe by index/column label iloc方法将按数字索引/列对数据iloc进行切片，而loc方法将按索引/列标签对数据iloc进行切片

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

Answer 2

You can do, if you know the index already where your 'Reserve' value appears, let's say in this case it appears in the 10th record. 您可以这样做，如果您已经知道索引中“保留”值的显示位置，那么在这种情况下，它出现在第10条记录中。 I initially set everything to 'N', and then turn the first 10 rows to 'Y'. 我最初将所有内容设置为“ N”，然后将前10行设置为“ Y”。

away_team_stats['Starter'] = 'N'
away_team_stats.loc[:9, 'Starter'] = 'Y'

Or you can do: 或者，您可以执行以下操作：

 idx = away_team_stats.loc[away_team_stats['Starter'] == 'Reserve'].index[0]

This gives you in which index 'Reserve' appears for the first time. 这使您可以在其中首次显示索引“预留”。

You can now do it as above: 您现在可以如上所述进行操作：

 away_team_stats.loc[:idx, 'Starter'] = 'Y' away_team_stats.loc[idx+1:, 'Starter'] = 'N'

Sets the first few rows until the word 'Reserve' appeared for the first time to 'Y', and then sets the remaining to 'N'. 将前几行设置为直到第一次出现单词“ Reserve”时才将其设置为“ Y”，然后将其余行设置为“ N”。

添加DataFrame列默认值时，如何将其限制为特定行？

问题描述

2 个解决方案

解决方案1
0 已采纳 2019-07-17 20:25:08

解决方案2
0 2019-07-17 20:42:44

添加DataFrame列默认值时，如何将其限制为特定行？

问题描述

2 个解决方案

解决方案1 0 已采纳 2019-07-17 20:25:08

解决方案2 0 2019-07-17 20:42:44

解决方案1
0 已采纳 2019-07-17 20:25:08

解决方案2
0 2019-07-17 20:42:44