简体   繁体   English

python:将组大小连接到数据框中的成员行

[英]python: join group size to member rows in dataframe

(Python 2.7) I wish to create a column in a python dataframe with the size of the group to which member rows belong (indexed by row ID number). (Python 2.7)我希望在python数据框中创建一列,其成员行所属的组的大小(由行ID号索引)。 Groups are based on rows with identical values in two columns, date and amount. 组基于两列(日期和金额)中具有相同值的行。 I've attempted to use groubpy and size - which is suggested for similar problems - but I can't get the resulting size values back to the source dataframe due to indexing problems. 我尝试使用groubpy和size-类似问题建议使用-但由于索引问题,我无法将生成的size值返回到源数据帧。 Should I use a dictionary to read all unique value pairings instead, and what would that look like? 我应该使用字典来读取所有唯一值对吗,那会是什么样? Or should I learn how to merge the groupby object to the original dataframe with a join operation. 还是我应该学习如何通过联接操作将groupby对象合并到原始数据帧。 Note: this is large dataset. 注意:这是大数据集。

Sample data: 样本数据:

                    date    amount  address
    ID          
    176820  1/4/2008 0:00   400     13496 ST LOUIS
    176821  1/4/2008 0:00   500     13475 NEWBERN
    176822  1/4/2008 0:00   2000    8011 DAYTON
    176823  1/4/2008 0:00   4000    13406 LONGVIEW
    176824  1/4/2008 0:00   7000    19174 ARCHDALE

Here's what I thought might work: 我认为这可能会起作用:

    df['group_size'] = df.groupby(['date','amount']).size()

But I received this: TypeError: incompatible index of inserted column with frame index 但我收到此消息:TypeError:插入的列与框架索引的索引不兼容

UPDATE: elyase's solution works for the original sample data I posted. 更新:elyase的解决方案适用于我发布的原始样本数据。 My source dataframe actually has 13 columns, not 3, but elyase's solution doesn't work when even one additional column is added to the sample frame. 我的源数据帧实际上有13列,而不是3列,但是,即使将另外一列添加到示例帧中,elyase的解决方案也不起作用。

                     date  amount         address    tract
    ID                                                    
    176820  1/4/2008 0:00     400  13496 ST LOUIS   510200
    176821  1/4/2008 0:00     500   13475 NEWBERN   510400
    176822  1/4/2008 0:00    2000     8011 DAYTON   526200
    176823  1/4/2008 0:00    4000  13406 LONGVIEW   504200
    176824  1/4/2008 0:00    7000  19174 ARCHDALE   540200

I get the error: Wrong number of items passed 1, indices imply 2 我收到错误消息:错误传递的项目数1,索引暗示2

你有没有尝试过:

df.groupby(['date','amount']).transform('count')

To get the group count, I needed to count over any OTHER variable in my group. 为了获得组数,我需要计算组中的其他任何变量。 The only issue here is that where the amount column is null, size returns the tract value, but this is easily dealt with. 唯一的问题是,在“数量”列为null的情况下,“大小”返回tract的值,但这很容易解决。

    df['group_size'] = df.groupby(['date','amount'])['tract'].transform('count') 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM