简体   繁体   English


[英]Pandas split column and aggreate result with duplicates in index

I have the following dataframe: 我有以下数据帧:

ID     Type      Value
1        A         311
1        A         223
1        B        1233
2        A         424
2        A         553
3        A          11
3        B           4
3        B           5

I am trying aggreate the "ID" column by splitting the column "Type", such that each ID has its own row and respective columns for Type A and Type B. In the columns "A" and "B" I want to assign the first occurance of each respective value across the rows. 我正在尝试通过拆分“类型”列来聚合“ID”列,这样每个ID都有自己的行以及类型A和类型B的相应列。在“A”和“B”列中,我想分配第一次出现行中的每个相应值。 If either A or B (or both) are missing I want to assign NaN. 如果缺少A或B(或两者),我想指定NaN。 To make this idea clear, the following example depicts the result I am looking for: 为了明确这个想法,下面的例子描述了我正在寻找的结果:

   ID       A           B
    1      311        1233
    2      424         NaN
    3       11           4

The result keeps the first value that appeared for A (while ignoring the second value for A 223). 结果保留A出现的第一个值(忽略A 223的第二个值)。 Since there is no second value for B in ID 1, it just keeps the value 1233. This logic continues for the other ID's. 由于ID 1中没有B的第二个值,因此它只保留值1233.此逻辑继续用于其他ID。

I've been trying to solve this using .pivot using 我一直试图使用.pivot解决这个.pivot


which helps me to seperate the Type column, such that I get: 这有助于我分离Type列,这样我得到:

Type      A        B
  0      311      NaN
  1      223      NaN
  2      NaN     1233
  3      11         4

However I am not able to pass the ID column as index, as it gives me the error: 但是我无法将ID列作为索引传递,因为它给出了错误:

ValueError: Index contains duplicate entries, cannot reshape

Using drop_duplicates on the ID column however results in data loss. drop_duplicates ,在ID列上使用drop_duplicates导致数据丢失。 Is there any handy way of doing such an operation in pandas? 有没有方便的方法在熊猫中进行这样的操作?

You need to drop duplicates first before you pivot. 在转动之前,您需要先删除重复项。

df.drop_duplicates(['ID', 'Type']).pivot('ID', 'Type', 'Value')

Type      A       B
1     311.0  1233.0
2     424.0     NaN
3      11.0     4.0

Or, use pivot_table with aggfunc='first' : 或者,使用pivot_table with aggfunc='first'

df.pivot_table(index='ID', columns='Type', values='Value', aggfunc='first')

Type      A       B
1     311.0  1233.0
2     424.0     NaN
3      11.0     4.0

Performance 性能
This actually depends on your data, and the number of groups. 这实际上取决于您的数据和组的数量。 Best is to test it out on your own data. 最好是根据自己的数据进行测试。

df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)

%timeit df.pivot_table(index='ID', columns='Type', values='Value', aggfunc='first')
%timeit df.drop_duplicates(['ID', 'Type']).pivot('ID', 'Type', 'Value')
%timeit df.groupby(['ID', 'Type']).Value.first().unstack(1)

15.2 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.63 ms ± 98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.34 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using groupby first first使用groupby

Type      A       B
1     311.0  1233.0
2     424.0     NaN
3      11.0     4.0

Or using groupby head with pivot 或者使用groupby head with pivot

df.groupby(['ID','Type'],as_index=False).head(1).pivot('ID', 'Type', 'Value')
Type      A       B
1     311.0  1233.0
2     424.0     NaN
3      11.0     4.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM