[英]Use groupby in Pandas to count things in one column in comparison to another
也许 groupby 是错误的方法。 似乎它应该工作,但我没有看到它......
我想根据结果对事件进行分组。 这是我的数据帧(df):
Status Event
SUCCESS Run
SUCCESS Walk
SUCCESS Run
FAILED Walk
这是我想要的结果:
Event SUCCESS FAILED
Run 2 1
Walk 0 1
我正在尝试制作一个分组对象,但我不知道如何调用它来显示我想要的内容。
grouped = df['Status'].groupby(df['Event'])
尝试这个:
pd.crosstab(df.Event, df.Status)
Status FAILED SUCCESS
Event
Run 0 2
Walk 1 1
len("df.groupby('Event').Status.value_counts().unstack().fillna(0)")
61
len("df.pivot_table(index='Event', columns='Status', aggfunc=len, fill_value=0)")
74
len("pd.crosstab(df.Event, df.Status)")
32
另一种解决方案,使用pivot_table()方法:
In [5]: df.pivot_table(index='Event', columns='Status', aggfunc=len, fill_value=0)
Out[5]:
Status FAILED SUCCESS
Event
Run 0 2
Walk 1 1
针对 700K DF 的时序:
In [74]: df.shape
Out[74]: (700000, 2)
In [75]: # (c) Merlin
In [76]: %%timeit
....: pd.crosstab(df.Event, df.Status)
....:
1 loop, best of 3: 333 ms per loop
In [77]: # (c) piRSquared
In [78]: %%timeit
....: df.groupby('Event').Status.value_counts().unstack().fillna(0)
....:
1 loop, best of 3: 325 ms per loop
In [79]: # (c) MaxU
In [80]: %%timeit
....: df.pivot_table(index='Event', columns='Status',
....: aggfunc=len, fill_value=0)
....:
1 loop, best of 3: 367 ms per loop
In [81]: # (c) ayhan
In [82]: %%timeit
....: (df.assign(ones = np.ones(len(df)))
....: .pivot_table(index='Event', columns='Status',
....: aggfunc=np.sum, values = 'ones')
....: )
....:
1 loop, best of 3: 264 ms per loop
In [83]: # (c) Divakar
In [84]: %%timeit
....: unq1,ID1 = np.unique(df['Event'],return_inverse=True)
....: unq2,ID2 = np.unique(df['Status'],return_inverse=True)
....: # Get linear indices/tags corresponding to grouped headers
....: tag = ID1*(ID2.max()+1) + ID2
....: # Setup 2D Numpy array equivalent of expected Dataframe
....: out = np.zeros((len(unq1),len(unq2)),dtype=int)
....: unqID, count = np.unique(tag,return_counts=True)
....: np.put(out,unqID,count)
....: # Finally convert to Dataframe
....: df_out = pd.DataFrame(out,columns=unq2)
....: df_out.index = unq1
....:
1 loop, best of 3: 2.25 s per loop
结论: @ayhan的解决方案目前获胜:
(df.assign(ones = np.ones(len(df)))
.pivot_table(index='Event', columns='Status', values = 'ones',
aggfunc=np.sum, fill_value=0)
)
这是一种基于 NumPy 的方法 -
# Get unique header strings for input dataframes
unq1,ID1 = np.unique(df['Event'],return_inverse=True)
unq2,ID2 = np.unique(df['Status'],return_inverse=True)
# Get linear indices/tags corresponding to grouped headers
tag = ID1*(ID2.max()+1) + ID2
# Setup 2D Numpy array equivalent of expected Dataframe
out = np.zeros((len(unq1),len(unq2)),dtype=int)
unqID, count = np.unique(tag,return_counts=True)
np.put(out,unqID,count)
# Finally convert to Dataframe
df_out = pd.DataFrame(out,columns=unq2)
df_out.index = unq1
示例输入,在更通用的情况下输出 -
In [179]: df
Out[179]:
Event Status
0 Sit PASS
1 Run SUCCESS
2 Walk SUCCESS
3 Run PASS
4 Run SUCCESS
5 Walk FAILED
6 Walk PASS
In [180]: df_out
Out[180]:
FAILED PASS SUCCESS
Run 0 1 2
Sit 0 1 0
Walk 1 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.