简体   繁体   English

Python:在 Pandas GroupBy 对象上有效地使用 apply

[英]Python: Using apply efficiently on a pandas GroupBy object

I am trying to perform a task which is conceptually simple, but my code seems to be way too expensive.我正在尝试执行一个概念上很简单的任务,但我的代码似乎太昂贵了。 I am looking for a faster way, potentially utilizing pandas' built-in functions for GroupBy objects.我正在寻找一种更快的方法,可能会利用 Pandas 的 GroupBy 对象的内置函数。

The starting point is a DataFrame called prices, with columns=[['item', 'store', 'day', 'price']], in which each observatoin is the most recent price update specific to a item-store combination.起点是一个名为价格的数据帧,列 =[['item', 'store', 'day', 'price']],其中每个 observatoin 是特定于商品-商店组合的最新价格更新。 The problem is that some price updates are the same as the last price update for the same item-store combination.问题在于,某些价格更新与同一商品商店组合的最后一次价格更新相同。 For example, let us look at a particular piece:例如,让我们看一个特定的片段:

       day  item_id  store_id  price
35083   34    85376       211   5.95
56157   41    85376       211   6.00
63628   50    85376       211   5.95
64955   51    85376       211   6.00
66386   56    85376       211   6.00
69477   69    85376       211   5.95

In this example I would like the observation where day equals 56 to be dropped (because price is the same as the last observation in this group).在此示例中,我希望删除天等于 56 的观察(因为价格与该组中的最后一次观察相同)。 My code is:我的代码是:

def removeSameLast(df):

    shp = df.shape[0]
    lead = df['price'][1:shp]
    lag = df['price'][:shp-1]
    diff = np.array(lead != lag)

    boo = np.array(1)
    boo = np.append(boo,diff)
    boo = boo.astype(bool)

    df = df.loc[boo]

    return df

gCell = prices.groupby(['item_id', 'store_id'])
prices = gCell.apply(removeSameLast)

This does the job, but is ugly and slow.这可以完成工作,但既丑陋又缓慢。 Sorry for being a noob, but I assume that this can be done much faster.对不起,我是个菜鸟,但我认为这可以做得更快。 Could someone please propose a solution?有人可以提出解决方案吗? Many thanks in advance.提前谢谢了。

I would suggest going for a simple solution using the shift function from Pandas.我建议使用 Pandas 的shift函数寻求一个简单的解决方案。 This would remove the use of the groupby and your function call.这将删除groupby和您的函数调用的使用。

The idea is to see where the Series [5.95, 6, 5.95, 6, 6, 5.95] is equal to the shifted one, [nan, 5.95, 6, 5.95, 6, 6] and delete(or just don't select) the rows where this condition happens.这个想法是看看系列[5.95, 6, 5.95, 6, 6, 5.95]等于移位的一个, [nan, 5.95, 6, 5.95, 6, 6]和删除(或只是不选择) 发生这种情况的行。

>>> mask = ~np.isclose(prices['price'], prices['price'].shift())
>>> prices[mask]
       day  item_id store_id    price
35083   34    85376      211    5.95
56157   41    85376      211    6.00
63628   50    85376      211    5.95
64955   51    85376      211    6.00
69477   69    85376      211    5.95

Simple benchmark:简单的基准:

%timeit prices = gCell.apply(removeSameLast)
100 loops, best of 3: 4.46 ms per loop

%timeit mask = df.price != df.price.shift()
1000 loops, best of 3: 183 µs per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM