简体   繁体   English

在熊猫中,如何分组并在每个整个组上应用/转换(不是聚合)?

[英]In pandas, how to groupby and apply/transform on each whole group (NOT aggregation)?

I've looked into agg/apply/transform after groupby, but none of them seem to meet my need.我在 groupby 之后研究了 agg/apply/transform,但它们似乎都不能满足我的需要。 Here is an example DF:这是一个示例 DF:

df_seq = pd.DataFrame({
    'person':['Tom', 'Tom', 'Tom', 'Lucy', 'Lucy', 'Lucy'],
    'day':[1,2,3,1,4,6],
    'food':['beef', 'lamb', 'chicken', 'fish', 'pork', 'venison']
})

person,day,food
Tom,1,beef
Tom,2,lamb
Tom,3,chicken
Lucy,1,fish
Lucy,4,pork
Lucy,6,venison

The day column shows that, for each person , he/she consumes food in sequential orders. day列显示,对于每个person ,他/她按顺序消费食物。

Now I would like to group by the person col, and create a DataFrame which contains food pairs for two neighboring days/time (as shown below) .现在我想按person col 分组,并创建一个 DataFrame,其中包含两个相邻日期/时间的食物对(如下所示)

Note the day column is only for example purpose here so the values of it should not be used .请注意,此处的day列仅用于示例目的,因此不应使用它的值 It only means the food column is in sequential order.这仅表示food列是按顺序排列的。 In my real data, it's a datetime column.在我的真实数据中,它是一个日期时间列。

person,day,food,food_next
Tom,1,beef,lamb
Tom,2,lamb,chicken
Lucy,1,fish,pork
Lucy,4,pork,venison

At the moment, I can only do this with a for-loop to iterate through all users.目前,我只能使用 for 循环遍历所有用户来执行此操作。 It's very slow.这很慢。

Is it possible to use a groupby and apply/transform to achieve this, or any vectorized operations?是否可以使用 groupby 并应用/转换来实现此目的或任何矢量化操作?

Create new column by DataFrameGroupBy.shift and then remove rows with missing values in food_next by DataFrame.dropna :通过 DataFrameGroupBy.shift 创建新列,然后通过DataFrameGroupBy.shift删除food_next中缺失值的DataFrame.dropna

df = (df_seq.assign(food_next = df_seq.groupby('person')['food'].shift(-1))
            .dropna(subset=['food_next']))
print (df)
  person  day  food food_next
0    Tom    1  beef      lamb
1    Tom    2  lamb   chicken
3   Lucy    1  fish      pork
4   Lucy    4  pork   venison

This might be a slightly patchy answer, and it doesn't perform an aggregation in the standard sense.这可能是一个稍微不完整的答案,并且它不执行标准意义上的聚合。

First, a small querying function that, given a name and a day, will return the first result (assuming the data is pre-sorted) that matches the parameters, and failing that, returns some default value:首先,一个小的查询函数,给定名称和日期,将返回与参数匹配的第一个结果(假设数据已预先排序),如果失败,则返回一些默认值:

def get_next_food(df, person, day):
    results = df.query(f"`person`=='{person}' and `day`>{day}")
    if len(results)>0:
        return results.iloc[0]['food']
    else:
        return "Mystery"

You can use this as follows:您可以按如下方式使用它:

get_food(df_seq,"Tom", 1)

> 'lamb'

Now, we can use this in an apply statement, to populate a new column with the results of this function applied row-wise:现在,我们可以在apply语句中使用它,以按行应用此函数的结果填充新列:

df_seq['next_food']=df_seq.apply(lambda x : get_food(df_seq, x['person'], x['day']), axis=1)

>
  person  day     food next_food
0    Tom    1     beef      lamb
1    Tom    2     lamb   chicken
2    Tom    3  chicken   Mystery
3   Lucy    1     fish      pork
4   Lucy    4     pork   venison
5   Lucy    6  venison   Mystery

Give it a try, I'm not convinced you'll see a vast performance improvement, but it'd be interesting to find out.试一试,我不相信你会看到巨大的性能提升,但发现它会很有趣。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM