[英]Pandas - count if multiple conditions
Having a dataframe in python: 在python中有一个数据框:
CASE TYPE
1 A
1 A
1 A
2 A
2 B
3 B
3 B
3 B
how can I create a result dataframe which would yield all cases and either an "A" if the case had only "A's" assigned, "B" if it was only "B's" or "MIXED" if the case had both A and B? 我如何创建一个结果数据框,将产生所有案例,如果案例仅分配了“ A”,则为“ A”,如果仅分配“ B”则为“ B”,如果案例同时为A和B,则为“ MIXED” ?
Result would be then: 结果将是:
Case Type
1 A
2 MIXED
3 B
Here is an option, where we firstly collect the TYPE as list by group of CASE
and then check the length
of unique TYPE, if it is larger than 1
, return MIXED
otherwise the TYPE by itself: 这是一个选项,我们首先按
CASE
组收集TYPE作为列表,然后检查唯一TYPE的length
,如果它大于1
,则返回MIXED
否则返回TYPE本身:
import pandas as pd
import numpy as np
groups = df.groupby('CASE').agg(lambda g: [g.TYPE.unique()]).
apply(lambda row: np.where(len(row.TYPE) > 1, 'MIXED', row.TYPE[0]), axis = 1)
groups
# CASE
# 1 A
# 2 MIXED
# 3 B
# dtype: object
df['NTYPES'] = df.groupby('CASE').transform(lambda x: x.nunique())
df.loc[df.NTYPES > 1, 'TYPE'] = 'MIXED'
df.groupby('TYPE', as_index=False).first().drop('NTYPES', 1)
TYPE CASE
0 A 1
1 B 3
2 MIXED 2
here is one bit ugly, but not that slow solution: 这有点丑陋,但不是那么慢的解决方案:
In [154]: df
Out[154]:
CASE TYPE
0 1 A
1 1 A
2 1 A
3 2 A
4 2 B
5 3 B
6 3 B
7 3 B
8 4 C
9 4 C
10 4 B
In [155]: %paste
(df.groupby('CASE')['TYPE']
.apply(lambda x: x.head(1) if x.nunique() == 1 else pd.Series(['MIX']))
.reset_index()
.drop('level_1', 1)
)
## -- End pasted text --
Out[155]:
CASE TYPE
0 1 A
1 2 MIX
2 3 B
3 4 MIX
Timing: against 800K rows DF: 时间:针对80万行DF:
In [191]: df = pd.concat([df] * 10**5, ignore_index=True)
In [192]: df.shape
Out[192]: (800000, 3)
In [193]: %timeit Psidom(df)
1 loop, best of 3: 235 ms per loop
In [194]: %timeit capitalistpug(df)
1 loop, best of 3: 419 ms per loop
In [195]: %timeit Alberto_Garcia_Raboso(df)
10 loops, best of 3: 112 ms per loop
In [196]: %timeit MaxU(df)
10 loops, best of 3: 80.4 ms per loop
Here is a (admittedly over-engineered) solution that avoids looping over groups and DataFrame.apply
(these are slow, so avoiding them may become important if your dataset gets sufficiently large). 这是一个(经过过度设计的)解决方案,可以避免在组和
DataFrame.apply
循环(这些过程很慢,因此,如果您的数据集足够大,避免它们可能变得很重要)。
import pandas as pd
df = pd.DataFrame({'CASE': [1]*3 + [2]*2 + [3]*3,
'TYPE': ['A']*4 + ['B']*4})
We group by CASE
and compute the relative frequencies of TYPE
being A
or B
: 我们按
CASE
分组并计算TYPE
的相对频率为A
或B
:
grouped = df.groupby('CASE')
vc = (grouped['TYPE'].value_counts(normalize=True)
.unstack(level=0)
.fillna(0))
Here's what vc
looks like 这是
vc
样子
CASE 1 2 3
TYPE
A 1.0 0.5 0.0
B 0.0 0.5 0.0
Notice that all the information is contained in the first row. 请注意,所有信息都包含在第一行中。 Cutting said row into bins with
pd.cut
gives the desired result: 用
pd.cut
将所述行切成垃圾箱可得到所需结果:
tolerance = 1e-10
bins = [-tolerance, tolerance, 1-tolerance, 1+tolerance]
types = pd.cut(vc.loc['A'], bins=bins, labels=['B', 'MIXED', 'A'])
We get: 我们得到:
CASE
1 A
2 MIXED
3 B
Name: A, dtype: category
Categories (3, object): [B < MIXED < A]
For good measure, we can rename the types
series: 为了更好地衡量,我们可以重命名
types
系列:
types.name = 'TYPE'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.