简体   繁体   English

熊猫集合内的计算

[英]Calculations within pandas aggregate

I am trying to perform a calculation within pandas aggregations. 我正在尝试在pandas聚合中执行计算。 I want the calculations to be included in the aggregations. 我希望将计算包括在聚合中。 The code on what I am attempting is below. 我正在尝试的代码如下。 I am also using the pandas package for the df. 我还在df中使用pandas软件包。

data = data.groupby(['type', 'name']).agg({'values': [np.min, np.max, 100 * sum([('values' > 3200)] / [np.size])]})

The formula I am trying to calculate is below: 我尝试计算的公式如下:

100 * sum((values > 3200) / (np.size))

This is where np is the size of the aggregation (the numbers aggregated are limited to numbers > 3200). 这是np是聚合大小的位置(聚合的数量限制为> 3200的数字)。 How to perform calculations like this within the aggregations would be of great help. 如何在聚合中执行这样的计算会很有帮助。

Example input data (actual dataset is much larger). 输入数据示例(实际数据集要大得多)。 The repeat values are due to the aggregation. 重复值归因于聚合。

type, name, values
apple, blue, 2500
orange, green, 2800
peach, black, 3300
lemon, white, 3500

Desired example output (numbers are not correct due to the fact that I have yet to be able to perform the calculation): 所需的示例输出(由于我尚未能够执行计算,因此数字不正确):

type, name, values, np.min, np.max, calcuation
apple, blue, 2500, 1200, 40000, 2300
orange, green, 2800, 1200, 5000, 2500

Passing df.agg a dictionary is used to specify the name of the output columns, here you're essentially writing an aggregation function which is attempting to use three formulas for one named column, and that column is already in your dataframe so its going to fail. 通过df.agg字典来指定输出列的名称,在这里您实质上是在编写一个聚合函数,该函数试图对一个命名列使用三个公式,并且该列已经在您的数据框中,因此它将失败。

What you should be doing should look more like: 您应该做的事情应该更像是:

data = data.groupby(['type', 'name']).agg({'min':np.min, 'max':np.max, 'calculation': calculation})

Where you've rewritten your calculation function as either a lambda or a custom function, depending on how you want to do things. 根据要执行的操作,将计算函数重写为lambda或自定义函数的位置。

You need to define the function that acts on the group to give you the percentage of values greater than 3200 and pass this, along with the other function into .agg : 您需要定义作用在组上的函数,以为您提供大于3200的值的百分比,并将其与其他函数一起传递至.agg

func = lambda series: 100* (series > 3200).mean(); 
data.groupby(['type', 'name']).values.agg({'min': min, 'max': max, 'calculation': func})

The mean of a boolean vector gives the percentage of True values, which is a nicer way of calculating it. 布尔向量的平均值给出True值的百分比,这是一种更好的计算方式。 Also, you can pass common function names such as min and max in as strings. 另外,您可以将常见的函数名称(例如min和max)作为字符串传递。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM