[英]Python: Using apply efficiently on a pandas GroupBy object
I am trying to perform a task which is conceptually simple, but my code seems to be way too expensive.我正在尝试执行一个概念上很简单的任务,但我的代码似乎太昂贵了。 I am looking for a faster way, potentially utilizing pandas' built-in functions for GroupBy objects.
我正在寻找一种更快的方法,可能会利用 Pandas 的 GroupBy 对象的内置函数。
The starting point is a DataFrame called prices, with columns=[['item', 'store', 'day', 'price']], in which each observatoin is the most recent price update specific to a item-store combination.起点是一个名为价格的数据帧,列 =[['item', 'store', 'day', 'price']],其中每个 observatoin 是特定于商品-商店组合的最新价格更新。 The problem is that some price updates are the same as the last price update for the same item-store combination.
问题在于,某些价格更新与同一商品商店组合的最后一次价格更新相同。 For example, let us look at a particular piece:
例如,让我们看一个特定的片段:
day item_id store_id price
35083 34 85376 211 5.95
56157 41 85376 211 6.00
63628 50 85376 211 5.95
64955 51 85376 211 6.00
66386 56 85376 211 6.00
69477 69 85376 211 5.95
In this example I would like the observation where day equals 56 to be dropped (because price is the same as the last observation in this group).在此示例中,我希望删除天等于 56 的观察(因为价格与该组中的最后一次观察相同)。 My code is:
我的代码是:
def removeSameLast(df):
shp = df.shape[0]
lead = df['price'][1:shp]
lag = df['price'][:shp-1]
diff = np.array(lead != lag)
boo = np.array(1)
boo = np.append(boo,diff)
boo = boo.astype(bool)
df = df.loc[boo]
return df
gCell = prices.groupby(['item_id', 'store_id'])
prices = gCell.apply(removeSameLast)
This does the job, but is ugly and slow.这可以完成工作,但既丑陋又缓慢。 Sorry for being a noob, but I assume that this can be done much faster.
对不起,我是个菜鸟,但我认为这可以做得更快。 Could someone please propose a solution?
有人可以提出解决方案吗? Many thanks in advance.
提前谢谢了。
I would suggest going for a simple solution using the shift function from Pandas.我建议使用 Pandas 的shift函数寻求一个简单的解决方案。 This would remove the use of the
groupby
and your function call.这将删除
groupby
和您的函数调用的使用。
The idea is to see where the Series [5.95, 6, 5.95, 6, 6, 5.95]
is equal to the shifted one, [nan, 5.95, 6, 5.95, 6, 6]
and delete(or just don't select) the rows where this condition happens.这个想法是看看系列
[5.95, 6, 5.95, 6, 6, 5.95]
等于移位的一个, [nan, 5.95, 6, 5.95, 6, 6]
和删除(或只是不选择) 发生这种情况的行。
>>> mask = ~np.isclose(prices['price'], prices['price'].shift())
>>> prices[mask]
day item_id store_id price
35083 34 85376 211 5.95
56157 41 85376 211 6.00
63628 50 85376 211 5.95
64955 51 85376 211 6.00
69477 69 85376 211 5.95
Simple benchmark:简单的基准:
%timeit prices = gCell.apply(removeSameLast)
100 loops, best of 3: 4.46 ms per loop
%timeit mask = df.price != df.price.shift()
1000 loops, best of 3: 183 µs per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.