简体   繁体   English

申请function到pandas groupby

[英]Apply function to pandas groupby

I have a pandas dataframe with a column called my_labels which contains strings: 'A', 'B', 'C', 'D', 'E' .我有一个 pandas dataframe 和一个名为my_labels的列,其中包含字符串: 'A', 'B', 'C', 'D', 'E' I would like to count the number of occurances of each of these strings then divide the number of counts by the sum of all the counts.我想计算每个字符串出现的次数,然后将计数除以所有计数的总和。 I'm trying to do this in Pandas like this:我试图在 Pandas 中这样做:

func = lambda x: x.size() / x.sum()
data = frame.groupby('my_labels').apply(func)

This code throws an error, 'DataFrame object has no attribute 'size'.此代码引发错误,“DataFrame object 没有属性‘size’”。 How can I apply a function to calculate this in Pandas?我如何在 Pandas 中应用 function 来计算这个?

apply takes a function to apply to each value, not the series, and accepts kwargs. apply需要一个函数来应用于每个值,而不是系列,并接受 kwargs。 So, the values do not have the .size() method.因此,这些值没有.size()方法。

Perhaps this would work:也许这会奏效:

from pandas import *

d = {"my_label": Series(['A','B','A','C','D','D','E'])}
df = DataFrame(d)


def as_perc(value, total):
    return value/float(total)

def get_count(values):
    return len(values)

grouped_count = df.groupby("my_label").my_label.agg(get_count)
data = grouped_count.apply(as_perc, total=df.my_label.count())

The .agg() method here takes a function that is applied to all values of the groupby object .这里的.agg()方法采用一个函数,该函数应用于groupby 对象的所有值。

As of Pandas version 0.22 , there exists also an alternative to apply : pipe , which can be considerably faster than using apply (you can also check this question for more differences between the two functionalities). 从 Pandas 版本 0.22 开始,还有一种替代方法applypipe ,它比使用apply快得多(您也可以检查这个问题以了解两个功能之间的更多差异)。

For your example:对于您的示例:

df = pd.DataFrame({"my_label": ['A','B','A','C','D','D','E']})

  my_label
0        A
1        B
2        A
3        C
4        D
5        D
6        E

The apply version apply版本

df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])

gives

          my_label
my_label          
A         0.285714
B         0.142857
C         0.142857
D         0.285714
E         0.142857

and the pipe versionpipe版本

df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())

yields产量

my_label
A    0.285714
B    0.142857
C    0.142857
D    0.285714
E    0.142857

So the values are identical, however, the timings differ quite a lot (at least for this small dataframe):所以这些值是相同的,但是,时间差异很大(至少对于这个小数据帧):

%timeit df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
100 loops, best of 3: 5.52 ms per loop

and

%timeit df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
1000 loops, best of 3: 843 µs per loop

Wrapping it into a function is then also straightforward:将它包装成一个函数也很简单:

def get_perc(grp_obj):
    gr_size = grp_obj.size()
    return gr_size / gr_size.sum()

Now you can call现在你可以打电话

df.groupby('my_label').pipe(get_perc)

yielding屈服

my_label
A    0.285714
B    0.142857
C    0.142857
D    0.285714
E    0.142857

However, for this particular case, you do not even need a groupby , but you can just use value_counts like this:但是,对于这种特殊情况,您甚至不需要groupby ,但您可以像这样使用value_counts

df['my_label'].value_counts(sort=False) / df.shape[0]

yielding屈服

A    0.285714
C    0.142857
B    0.142857
E    0.142857
D    0.285714
Name: my_label, dtype: float64

For this small dataframe it is quite fast对于这个小数据框,它非常快

%timeit df['my_label'].value_counts(sort=False) / df.shape[0]
1000 loops, best of 3: 770 µs per loop

As pointed out by @anmol, the last statement can also be simplified to正如@anmol 所指出的,最后一条语句也可以简化为

df['my_label'].value_counts(sort=False, normalize=True)

Try:尝试:

g = pd.DataFrame(['A','B','A','C','D','D','E'])

# Group by the contents of column 0 
gg = g.groupby(0)  

# Create a DataFrame with the counts of each letter
histo = gg.apply(lambda x: x.count())

# Add a new column that is the count / total number of elements    
histo[1] = histo.astype(np.float)/len(g) 

print histo

Output:输出:

   0         1
0             
A  2  0.285714
B  1  0.142857
C  1  0.142857
D  2  0.285714
E  1  0.142857

Regarding the issue with 'size', size is not a function on a dataframe, it is rather a property.关于“大小”的问题,大小不是数据帧上的函数,而是一个属性。 So instead of using size(), plain size should work因此,与其使用 size(),不如使用普通大小

Apart from that, a method like this should work除此之外,这样的方法应该有效

def doCalculation(df):
    groupCount = df.size
    groupSum = df['my_labels'].notnull().sum()
    
    return groupCount / groupSum

dataFrame.groupby('my_labels').apply(doCalculation)

I saw a nested function technique for computing a weighted average on SO one time, altering that technique can solve your issue.我看到了一种用于计算 SO 一次加权平均值的嵌套函数技术,改变该技术可以解决您的问题。

def group_weight(overall_size):
    def inner(group):
        return len(group)/float(overall_size)
    inner.__name__ = 'weight'
    return inner

d = {"my_label": pd.Series(['A','B','A','C','D','D','E'])}
df = pd.DataFrame(d)
print df.groupby('my_label').apply(group_weight(len(df)))



my_label
A    0.285714
B    0.142857
C    0.142857
D    0.285714
E    0.142857
dtype: float64

Here is how to do a weighted average within groups以下是如何在组内进行加权平均

def wavg(val_col_name,wt_col_name):
    def inner(group):
        return (group[val_col_name] * group[wt_col_name]).sum() / group[wt_col_name].sum()
    inner.__name__ = 'wgt_avg'
    return inner



d = {"P": pd.Series(['A','B','A','C','D','D','E'])
     ,"Q": pd.Series([1,2,3,4,5,6,7])
    ,"R": pd.Series([0.1,0.2,0.3,0.4,0.5,0.6,0.7])
     }

df = pd.DataFrame(d)
print df.groupby('P').apply(wavg('Q','R'))

P
A    2.500000
B    2.000000
C    4.000000
D    5.545455
E    7.000000
dtype: float64

For your particular case, groupby is not required.对于您的特定情况,不需要 groupby。 Instead you can do:-相反,你可以这样做: -

df = pd.DataFrame({"my_label": ['A','B','A','C','D','D','E']})
df['my_label'].value_counts(sort=False, normalize=True)

This will return what you require这将返回您需要的内容

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM