繁体   English   中英

向Pandas数据帧添加系列会产生NaN列

[英]Adding series to Pandas dataframe yields column of NaN

使用此数据集(为简洁起见,省略了一些cols和数百行)。

    Year    Ceremony    Award          Winner   Name    
0   1927/1928   1       Best Actress    0.0     Louise Dresser  
1   1927/1928   1       Best Actress    1.0     Janet Gaynor
2   1937        10      Best Actress    0.0     Janet Gaynor
3   1927/1928   1       Best Actress    0.0     Gloria Swanson  
4   1929/1930   3       Best Actress    0.0     Gloria Swanson
5   1950        23      Best Actress    0.0     Gloria Swanson  

我使用了以下命令。

ba_dob.loc[ba_dob.Winner == 0.0, :].groupby('Name').Winner.count()

要创建以下系列。

Name
Ali MacGraw                1
Amy Adams                  1
Angela Bassett             1
Angelina Jolie             1
Anjelica Huston            1
Ann Harding                1
Ann-Margret                1
Anna Magnani               1
Anne Bancroft              4
Anne Baxter                1
Anne Hathaway              1
Annette Bening             3
Audrey Hepburn             4

我尝试将系列添加到原始数据框中,就像这样。

ba_dob['New_Col'] = ba_dob.loc[ba_dob.Winner == 0.0, :].groupby('Name').Winner.count()

我得到了一列NaN值。

我已经阅读了其他帖子,表明工作中可能存在一些错误的索引,但我不确定这会如何发生。 更具体地说,为什么Pandas无法排列索引,因为groupby和count来自同一个表。 还有其他事情在发生吗?

我认为你需要size ,而不是count ,因为count排除NaN

上一个map列按groupby创建的Series Name

m = ba_dob.Winner == 0.0
ba_dob['new'] = ba_dob['Name'].map(ba_dob[m].groupby('Name').Winner.size())
print (ba_dob)
        Year  Ceremony         Award  Winner            Name  new
0  1927/1928         1  Best Actress     0.0  Louise Dresser    1
1  1927/1928         1  Best Actress     1.0    Janet Gaynor    1
2       1937        10  Best Actress     0.0    Janet Gaynor    1
3  1927/1928         1  Best Actress     0.0  Gloria Swanson    3
4  1929/1930         3  Best Actress     0.0  Gloria Swanson    3
5       1950        23  Best Actress     0.0  Gloria Swanson    3

另一种方案:

ba_dob['new'] = ba_dob['Name'].map(ba_dob.loc[m, 'Name'].value_counts())

您可以在初始数据框中加入结果

New_col = df.loc[df.Winner == 0.0, :].groupby('Name').Winner.count().rename('New_col')
df = df.join(New_col, on='Name')

输出:

    Award           Ceremony    Name            Winner  Year New_col
0   Best Actress    1927/1928   Louise Dresser  0.0     0    1
1   Best Actress    1927/1928   Janet Gaynor    1.0     1    1
2   Best Actress    1937        Janet Gaynor    0.0     2    1
3   Best Actress    1927/1928   Gloria Swanson  0.0     3    3
4   Best Actress    1929/1930   Gloria Swanson  0.0     4    3
5   Best Actress    1950        Gloria Swanson  0.0     5    3

你也可以使用地图

mapper = ba_dob.loc[ba_dob.Winner == 0.0, :].groupby('Name').Winner.count()
ba_dob['New_Col'] = ba_dob['Name'].map(mapper)

你得到

    Year        Ceremony    Award       Winner  Name            New_Col
0   1927/1928   1           BestActress 0.0     Louise Dresser  1
1   1927/1928   1           BestActress 1.0     Janet Gaynor    1
2   1937        10          BestActress 0.0     Janet Gaynor    1
3   1927/1928   1           BestActress 0.0     Gloria Swanson  3
4   1929/1930   3           BestActress 0.0     Gloria Swanson  3
5   1950        23          BestActress 0.0     Gloria Swanson  3

您需要使用reset_index(),它会删除层次结构并创建两个字段Name&Count.Post,选择“Count”字段将其添加到dataframe。 就像是

 ba_dob['New_Col'] = ba_dob.loc[ba_dob.Winner == 0.0, :].groupby('Name').Winner.count().reset_index()['count']

您的groupby不会覆盖整个DataFrame ,只会覆盖Winner == 0的行,所以当然对于这些行,您将获得NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM