简体   繁体   English

Pandas拆分列和聚合结果与索引中的重复项

[英]Pandas split column and aggreate result with duplicates in index

I have the following dataframe: 我有以下数据帧:

ID     Type      Value
1        A         311
1        A         223
1        B        1233
2        A         424
2        A         553
3        A          11
3        B           4
3        B           5

I am trying aggreate the "ID" column by splitting the column "Type", such that each ID has its own row and respective columns for Type A and Type B. In the columns "A" and "B" I want to assign the first occurance of each respective value across the rows. 我正在尝试通过拆分“类型”列来聚合“ID”列,这样每个ID都有自己的行以及类型A和类型B的相应列。在“A”和“B”列中,我想分配第一次出现行中的每个相应值。 If either A or B (or both) are missing I want to assign NaN. 如果缺少A或B(或两者),我想指定NaN。 To make this idea clear, the following example depicts the result I am looking for: 为了明确这个想法,下面的例子描述了我正在寻找的结果:

   ID       A           B
    1      311        1233
    2      424         NaN
    3       11           4

The result keeps the first value that appeared for A (while ignoring the second value for A 223). 结果保留A出现的第一个值(忽略A 223的第二个值)。 Since there is no second value for B in ID 1, it just keeps the value 1233. This logic continues for the other ID's. 由于ID 1中没有B的第二个值,因此它只保留值1233.此逻辑继续用于其他ID。

I've been trying to solve this using .pivot using 我一直试图使用.pivot解决这个.pivot

df.pivot(columns="Type",values="Value")

which helps me to seperate the Type column, such that I get: 这有助于我分离Type列,这样我得到:

Type      A        B
  0      311      NaN
  1      223      NaN
  2      NaN     1233
  3      11         4

However I am not able to pass the ID column as index, as it gives me the error: 但是我无法将ID列作为索引传递,因为它给出了错误:

ValueError: Index contains duplicate entries, cannot reshape

Using drop_duplicates on the ID column however results in data loss. drop_duplicates ,在ID列上使用drop_duplicates导致数据丢失。 Is there any handy way of doing such an operation in pandas? 有没有方便的方法在熊猫中进行这样的操作?

You need to drop duplicates first before you pivot. 在转动之前,您需要先删除重复项。

df.drop_duplicates(['ID', 'Type']).pivot('ID', 'Type', 'Value')

Type      A       B
ID                 
1     311.0  1233.0
2     424.0     NaN
3      11.0     4.0

Or, use pivot_table with aggfunc='first' : 或者,使用pivot_table with aggfunc='first'

df.pivot_table(index='ID', columns='Type', values='Value', aggfunc='first')

Type      A       B
ID                 
1     311.0  1233.0
2     424.0     NaN
3      11.0     4.0

Performance 性能
This actually depends on your data, and the number of groups. 这实际上取决于您的数据和组的数量。 Best is to test it out on your own data. 最好是根据自己的数据进行测试。

df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)

%timeit df.pivot_table(index='ID', columns='Type', values='Value', aggfunc='first')
%timeit df.drop_duplicates(['ID', 'Type']).pivot('ID', 'Type', 'Value')
%timeit df.groupby(['ID', 'Type']).Value.first().unstack(1)

15.2 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.63 ms ± 98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.34 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using groupby first first使用groupby

df.groupby(['ID','Type']).Value.first().unstack()
Type      A       B
ID                 
1     311.0  1233.0
2     424.0     NaN
3      11.0     4.0

Or using groupby head with pivot 或者使用groupby head with pivot

df.groupby(['ID','Type'],as_index=False).head(1).pivot('ID', 'Type', 'Value')
Type      A       B
ID                 
1     311.0  1233.0
2     424.0     NaN
3      11.0     4.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM