简体   繁体   English

Python pandas 将 function 应用于分组 Z6A8064B5DF4794555500553ZD77

[英]Python pandas apply function to grouped dataframe

I am working on Titanic dataset我正在研究泰坦尼克号数据集
One ticket can be issued for several passengers, ie several passengers may have same ticket number一张票可以为多位乘客开出,即多位乘客可以拥有相同的票号
The 'Fare' feature for all those passengers will be the same and equal to the whole ticket fare所有这些乘客的“票价”功能将相同并等于整个票价
For example if there are 4 passengers travelling by one ticket, the ticket fare can be 40$, but each passenger fare should be $10.例如,如果有 4 位乘客乘坐一张票,票价可以是 40 美元,但每位乘客的票价应该是 10 美元。
So one should divide ticket fare by ticket frequency to calculate fare per passenger.因此,应将票价除以出票频率来计算每位乘客的票价。
But there is one more thing: babies are charged $2 and children younger 12 are charged half adult fare.但还有一件事:婴儿票价为 2 美元,12 岁以下的儿童票价为成人票价的一半。
So I am trying to calculate price payed by each adult in the ticket taking into account children fares.因此,我正在尝试计算每位成人支付的票价,同时考虑到儿童票价。
Here is a sample data frame:这是一个示例数据框:

df = pd.DataFrame({'Age': [0.5,5,20,21,22,23,24], 'Fare': [17,17,17,40,40,40,40], 'TicketNum': [1,1,1,2,2,2,2]})
       Age      Fare    TicketNum
0 0.5 17 1
1 5.0 17 1
2 20.0 17 1
3 21.0 40 2
4 22.0 40 2
5 23.0 40 2
6 24.0 40 2

first I make this function:首先我做这个 function:

def fare_calc(x):
    ticket_fare = x['Fare'].mean()

    group_size = x.shape[0]
    babies_count = x[x['Age']<1].count()
    child_count = x[x['Age']<12].count()
    adult_count = group_size - babies_count - child_count
    adult_fare = (ticket_fare - babies_count * 2) / (adult_count + child_count*0.5)
    return adult_fare

then I try:然后我尝试:

df['TicketFreq'] = df.groupby('TicketNum')['TicketNum'].transform('count')
df['Fare2'] = df[df.TicketFreq>1].groupby(['TicketNum'])['Age','Fare'].agg(fare_calc)

and get an error:并得到一个错误:
ValueError: Wrong number of items passed 2, placement implies 1 ValueError:传递的项目数错误 2,位置暗示 1

desired output is the following:所需的 output 如下:

       Age      Fare    TicketNum    Fare2
0 0.5 17 1 10
1 5.0 17 1 10
2 20.0 17 1 10
3 21.0 40 2 10
4 22.0 40 2 10
5 23.0 40 2 10
6 24.0 40 2 10

Hey your formular seems to be wrong however the fare_calc function gets exectued when u replace the .agg call by .apply and remove the two columns you specified.嘿,您的公式似乎是错误的,但是当您将fare_calc调用替换为.agg并删除您指定的两列时,会执行 fare_calc .apply See example below请参阅下面的示例

df[df.TicketFreq>1].groupby(['TicketNum']).apply(fare_calc)

Further there were just a few changes necessary for your function.此外,您的 function 只需进行一些更改。
To get numeric numvers for the babie_count and child_count you need to specify a colum to get only one integer要获取 babie_count 和 child_count 的数字,您需要指定一个列以仅获取一个 integer

def fare_calc(x):
    ticket_fare = x['Fare'].mean()
    group_size = x.shape[0]
    babies_count = x[x['Age']<1]['Age'].count()
    child_count = x[x['Age']<12]['Age'].count()
    adult_count = group_size - babies_count - child_count
    adult_fare = (ticket_fare - babies_count * 2) / (adult_count + child_count * 0.5)
    return adult_fare

Here is my solution这是我的解决方案

I create columnwise series of values using the pd.Series() and .repeat() functions.我使用pd.Series().repeat()函数创建列值系列。

By the way, You forgot to exclude babies_count from child_count using (df['Age']<12) & (df['Age']>1)顺便说一句,您忘记使用(df['Age']<12) & (df['Age']>1)从 child_count 中排除 baby_count

def fare_calc(x):
    group_size   = x.shape[0]
    ticket_fare  = pd.Series(x['Fare'].mean().repeat(group_size))
    babies_count = x[x['Age']<1 ]['Age'].count()
    child_count  = x[(df['Age']<12) & (df['Age']>1)]['Age'].count()
    adult_count  = group_size - babies_count - child_count
    adult_fare   = (ticket_fare - babies_count * 2) / (adult_count + child_count * 0.5)
    return adult_fare

And finally extract solely values of the stacked Series created by the apply function using the .values to prevent from "incompatible index" TypeError.最后,使用 .values 单独提取由apply .values创建的堆叠系列的值,以防止“不兼容的索引”TypeError。

df['Fare2'] = df[df.TicketFreq>1].groupby(['TicketNum']).apply(fare_calc).values

print(df)
    Age  Fare  TicketNum  TicketFreq  Fare2
0   0.5    17          1           3   10.0
1   5.0    17          1           3   10.0
2  20.0    17          1           3   10.0
3  21.0    40          2           4   10.0
4  22.0    40          2           4   10.0
5  23.0    40          2           4   10.0
6  24.0    40          2           4   10.0

EDIT 1: more intuitive version of the previous function:编辑 1:以前 function 的更直观版本:

import pandas as pd

df = pd.DataFrame({'Age': [0.5,5,20,21,22,23,24], 'Fare': [17,17,17,40,40,40,40], 'TicketNum': [1,1,1,2,2,2,2]})
df['TicketFreq'] = df.groupby('TicketNum')['TicketNum'].transform('count')

def fare_calc(x):
    group_size       = x.shape[0]
    x['ticket_fare'] = x['Fare'].mean()
    babies_count     = x[x['Age']<1 ]['Age'].count()
    child_count      = x[(df['Age']<12) & (df['Age']>1)]['Age'].count()
    adult_count      = group_size - babies_count - child_count
    x['adult_fare']  = (x['ticket_fare'] - babies_count * 2) / (adult_count + child_count * 0.5)
    return x['adult_fare']

df['Fare2'] = df[df.TicketFreq>1].groupby(['TicketNum']).apply(fare_calc).values

print(df)
    Age  Fare  TicketNum  TicketFreq  Fare2
0   0.5    17          1           3   10.0
1   5.0    17          1           3   10.0
2  20.0    17          1           3   10.0
3  21.0    40          2           4   10.0
4  22.0    40          2           4   10.0
5  23.0    40          2           4   10.0
6  24.0    40          2           4   10.0

EDIT 2: even simpler where 'Fare2' is directly created inside the function编辑 2:在 function 中直接创建“Fare2”的地方更简单

import pandas as pd

df = pd.DataFrame({'Age': [0.5,5,20,21,22,23,24], 'Fare': [17,17,17,40,40,40,40], 'TicketNum': [1,1,1,2,2,2,2]})
df['TicketFreq'] = df.groupby('TicketNum')['TicketNum'].transform('count')

def fare_calc(x):
    group_size       = x.shape[0]
    ticket_fare      = x['Fare'].mean()
    babies_count     = x[x['Age']<1 ]['Age'].count()
    child_count      = x[(df['Age']<12) & (df['Age']>1)]['Age'].count()
    adult_count      = group_size - babies_count - child_count
    x['Fare2']       = (ticket_fare - babies_count * 2) / (adult_count + child_count * 0.5)
    return x

df = df[df.TicketFreq>1].groupby(['TicketNum']).apply(fare_calc)

print(df)
    Age  Fare  TicketNum  TicketFreq  Fare2
0   0.5    17          1           3   10.0
1   5.0    17          1           3   10.0
2  20.0    17          1           3   10.0
3  21.0    40          2           4   10.0
4  22.0    40          2           4   10.0
5  23.0    40          2           4   10.0
6  24.0    40          2           4   10.0

A minor but immediate problem is that in your last line of code, ['Age', 'Fare'] should be [['Age', 'Fare']] , as you want to index with a list of column names.一个小而直接的问题是,在您的最后一行代码中, ['Age', 'Fare']应该是[['Age', 'Fare']] ,因为您想使用列名列表进行索引。

The main issue is that you have written fare_calc() to work on the whole DataFrame, but the function that is passed to df.agg() will be applied to each column individually.主要问题是您已经编写fare_calc()来处理整个 DataFrame,但是传递给df.agg()的 function 将分别应用于每一列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM