简体   繁体   English

熊猫-计算多个条件

[英]Pandas - count if multiple conditions

Having a dataframe in python: 在python中有一个数据框:

CASE    TYPE
1          A
1          A
1          A
2          A
2          B
3          B
3          B
3          B

how can I create a result dataframe which would yield all cases and either an "A" if the case had only "A's" assigned, "B" if it was only "B's" or "MIXED" if the case had both A and B? 我如何创建一个结果数据框,将产生所有案例,如果案例仅分配了“ A”,则为“ A”,如果仅分配“ B”则为“ B”,如果案例同时为A和B,则为“ MIXED” ?

Result would be then: 结果将是:

Case     Type
1        A
2        MIXED
3        B

Here is an option, where we firstly collect the TYPE as list by group of CASE and then check the length of unique TYPE, if it is larger than 1 , return MIXED otherwise the TYPE by itself: 这是一个选项,我们首先按CASE组收集TYPE作为列表,然后检查唯一TYPE的length ,如果它大于1 ,则返回MIXED否则返回TYPE本身:

import pandas as pd
import numpy as np
groups = df.groupby('CASE').agg(lambda g: [g.TYPE.unique()]).
            apply(lambda row: np.where(len(row.TYPE) > 1, 'MIXED', row.TYPE[0]), axis = 1)
groups

# CASE
# 1           A
# 2       MIXED
# 3           B
# dtype: object
df['NTYPES'] = df.groupby('CASE').transform(lambda x: x.nunique())
df.loc[df.NTYPES > 1, 'TYPE'] = 'MIXED'
df.groupby('TYPE', as_index=False).first().drop('NTYPES', 1)

    TYPE  CASE
0      A     1
1      B     3
2  MIXED     2

here is one bit ugly, but not that slow solution: 这有点丑陋,但不是那么慢的解决方案:

In [154]: df
Out[154]:
    CASE TYPE
0      1    A
1      1    A
2      1    A
3      2    A
4      2    B
5      3    B
6      3    B
7      3    B
8      4    C
9      4    C
10     4    B

In [155]: %paste
(df.groupby('CASE')['TYPE']
   .apply(lambda x: x.head(1) if x.nunique() == 1 else pd.Series(['MIX']))
   .reset_index()
   .drop('level_1', 1)
)
## -- End pasted text --
Out[155]:
   CASE TYPE
0     1    A
1     2  MIX
2     3    B
3     4  MIX

Timing: against 800K rows DF: 时间:针对80万行DF:

In [191]: df = pd.concat([df] * 10**5, ignore_index=True)

In [192]: df.shape
Out[192]: (800000, 3)

In [193]: %timeit Psidom(df)
1 loop, best of 3: 235 ms per loop

In [194]: %timeit capitalistpug(df)
1 loop, best of 3: 419 ms per loop

In [195]: %timeit Alberto_Garcia_Raboso(df)
10 loops, best of 3: 112 ms per loop

In [196]: %timeit MaxU(df)
10 loops, best of 3: 80.4 ms per loop

Here is a (admittedly over-engineered) solution that avoids looping over groups and DataFrame.apply (these are slow, so avoiding them may become important if your dataset gets sufficiently large). 这是一个(经过过度设计的)解决方案,可以避免在组和DataFrame.apply循环(这些过程很慢,因此,如果您的数据集足够大,避免它们可能变得很重要)。

import pandas as pd
df = pd.DataFrame({'CASE': [1]*3 + [2]*2 + [3]*3,
                   'TYPE': ['A']*4 + ['B']*4})

We group by CASE and compute the relative frequencies of TYPE being A or B : 我们按CASE分组并计算TYPE的相对频率为AB

grouped = df.groupby('CASE')
vc = (grouped['TYPE'].value_counts(normalize=True)
                     .unstack(level=0)
                     .fillna(0))

Here's what vc looks like 这是vc样子

CASE   1    2    3
TYPE
A      1.0  0.5  0.0
B      0.0  0.5  0.0

Notice that all the information is contained in the first row. 请注意,所有信息都包含在第一行中。 Cutting said row into bins with pd.cut gives the desired result: pd.cut将所述行切成垃圾箱可得到所需结果:

tolerance = 1e-10
bins = [-tolerance, tolerance, 1-tolerance, 1+tolerance]
types = pd.cut(vc.loc['A'], bins=bins, labels=['B', 'MIXED', 'A'])

We get: 我们得到:

CASE
1        A
2    MIXED
3        B
Name: A, dtype: category
Categories (3, object): [B < MIXED < A]

For good measure, we can rename the types series: 为了更好地衡量,我们可以重命名types系列:

types.name = 'TYPE'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM