[英]Pandas split column and aggreate result with duplicates in index
I have the following dataframe: 我有以下数据帧:
ID Type Value
1 A 311
1 A 223
1 B 1233
2 A 424
2 A 553
3 A 11
3 B 4
3 B 5
I am trying aggreate the "ID" column by splitting the column "Type", such that each ID has its own row and respective columns for Type A and Type B. In the columns "A" and "B" I want to assign the first occurance of each respective value across the rows. 我正在尝试通过拆分“类型”列来聚合“ID”列,这样每个ID都有自己的行以及类型A和类型B的相应列。在“A”和“B”列中,我想分配第一次出现行中的每个相应值。 If either A or B (or both) are missing I want to assign NaN. 如果缺少A或B(或两者),我想指定NaN。 To make this idea clear, the following example depicts the result I am looking for: 为了明确这个想法,下面的例子描述了我正在寻找的结果:
ID A B
1 311 1233
2 424 NaN
3 11 4
The result keeps the first value that appeared for A (while ignoring the second value for A 223). 结果保留A出现的第一个值(忽略A 223的第二个值)。 Since there is no second value for B in ID 1, it just keeps the value 1233. This logic continues for the other ID's. 由于ID 1中没有B的第二个值,因此它只保留值1233.此逻辑继续用于其他ID。
I've been trying to solve this using .pivot
using 我一直试图使用.pivot
解决这个.pivot
df.pivot(columns="Type",values="Value")
which helps me to seperate the Type column, such that I get: 这有助于我分离Type列,这样我得到:
Type A B
0 311 NaN
1 223 NaN
2 NaN 1233
3 11 4
However I am not able to pass the ID column as index, as it gives me the error: 但是我无法将ID列作为索引传递,因为它给出了错误:
ValueError: Index contains duplicate entries, cannot reshape
Using drop_duplicates
on the ID column however results in data loss. drop_duplicates
,在ID列上使用drop_duplicates
导致数据丢失。 Is there any handy way of doing such an operation in pandas? 有没有方便的方法在熊猫中进行这样的操作?
You need to drop duplicates first before you pivot. 在转动之前,您需要先删除重复项。
df.drop_duplicates(['ID', 'Type']).pivot('ID', 'Type', 'Value')
Type A B
ID
1 311.0 1233.0
2 424.0 NaN
3 11.0 4.0
Or, use pivot_table
with aggfunc='first'
: 或者,使用pivot_table
with aggfunc='first'
:
df.pivot_table(index='ID', columns='Type', values='Value', aggfunc='first')
Type A B
ID
1 311.0 1233.0
2 424.0 NaN
3 11.0 4.0
Performance 性能
This actually depends on your data, and the number of groups. 这实际上取决于您的数据和组的数量。 Best is to test it out on your own data. 最好是根据自己的数据进行测试。
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.pivot_table(index='ID', columns='Type', values='Value', aggfunc='first')
%timeit df.drop_duplicates(['ID', 'Type']).pivot('ID', 'Type', 'Value')
%timeit df.groupby(['ID', 'Type']).Value.first().unstack(1)
15.2 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.63 ms ± 98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.34 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using groupby
first
first
使用groupby
df.groupby(['ID','Type']).Value.first().unstack()
Type A B
ID
1 311.0 1233.0
2 424.0 NaN
3 11.0 4.0
Or using groupby
head
with pivot
或者使用groupby
head
with pivot
df.groupby(['ID','Type'],as_index=False).head(1).pivot('ID', 'Type', 'Value')
Type A B
ID
1 311.0 1233.0
2 424.0 NaN
3 11.0 4.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.