[英]calculate difference between rows for the same product
I am using Python and Pandas to make some calculations over a dataframe that contains millions of rows like the following ones:我正在使用 Python 和 Pandas 对包含数百万行的 dataframe 进行一些计算,如下所示:
city product month sold
Milan Spaghetti 2021-03-31 30300
Milan Spaghetti 2021-06-30 26958
Milan Spaghetti 2021-09-30 26775
Milan Spaghetti 2021-12-31 44185
Milan Spaghetti 2022-03-31 32716
Milan Spaghetti 2022-06-30 25881
Milan Maccheroni 2021-03-31 89584
Milan Maccheroni 2021-06-30 81434
Milan Maccheroni 2021-09-30 83360
Milan Maccheroni 2021-12-31 123945
Milan Maccheroni 2022-03-31 97278
Milan Maccheroni 2022-06-30 82959
Rome Spaghetti 2021-01-31 1524
Rome Spaghetti 2021-04-30 1548
Rome Spaghetti 2021-07-31 1577
Rome Spaghetti 2021-10-31 1438
Rome Spaghetti 2022-01-31 1556
Rome Spaghetti 2022-04-30 1471
Rome Spaghetti 2022-07-31 1453
Rome Maccheroni 2021-01-31 15646
Rome Maccheroni 2021-04-30 15877
Rome Maccheroni 2021-07-31 15289
Rome Maccheroni 2021-10-31 16675
Rome Maccheroni 2022-01-31 17028
Rome Maccheroni 2022-04-30 16490
Rome Maccheroni 2022-07-31 14664
I am looking for a way to calculate the changes between subsequent months for the same city and product, so that the resulting dataframe is like that:我正在寻找一种方法来计算同一城市和产品的后续月份之间的变化,以便生成的 dataframe 如下所示:
city product month sold change
Milan Spaghetti 31/03/2021 30300
Milan Spaghetti 30/06/2021 26958 -3342
Milan Spaghetti 30/09/2021 26775 -183
Milan Spaghetti 31/12/2021 44185 17410
Milan Spaghetti 31/03/2022 32716 -11469
Milan Spaghetti 30/06/2022 25881 -6835
Milan Maccheroni 31/03/2021 89584
Milan Maccheroni 30/06/2021 81434 -8150
Milan Maccheroni 30/09/2021 83360 1926
Milan Maccheroni 31/12/2021 123945 40585
Milan Maccheroni 31/03/2022 97278 -26667
Milan Maccheroni 30/06/2022 82959 -14319
Rome Spaghetti 31/01/2021 1524
Rome Spaghetti 30/04/2021 1548 24
Rome Spaghetti 31/07/2021 1577 29
Rome Spaghetti 31/10/2021 1438 -139
Rome Spaghetti 31/01/2022 1556 118
Rome Spaghetti 30/04/2022 1471 -85
Rome Spaghetti 31/07/2022 1453 -18
Rome Maccheroni 31/01/2021 15646
Rome Maccheroni 30/04/2021 15877 231
Rome Maccheroni 31/07/2021 15289 -588
Rome Maccheroni 31/10/2021 16675 1386
Rome Maccheroni 31/01/2022 17028 353
Rome Maccheroni 30/04/2022 16490 -538
Rome Maccheroni 31/07/2022 14664 -1826
The code shod calculate the change between two rows only if the city and product field are the same.仅当城市和产品字段相同时,代码才计算两行之间的变化。 Is it possible to do it without iterating on the rows?
是否可以在不迭代行的情况下做到这一点?
Please ignore the change in the month format, it is irrelevant for the solution.请忽略月份格式的变化,它与解决方案无关。
This is easy to vectorize this with the shift
function.这很容易通过
shift
function 向量化。 You just have to make sure you:你只需要确保你:
NaN
the first month for every category after the fact.NaN
。 The code would look something like this:代码看起来像这样:
category_cols = ["city", "product"]
df = df.sort_values(category_cols + ["month"])
# Get difference
df["change"] = df.sold - df.sold.shift(1)
# Get new categories, mark missing
new_cat_mask = (df[category_cols] != df[category_cols].shift(1)).any(axis=1)
df.loc[new_cat_mask, "change"] = np.nan
This is what you are looking for.这就是你要找的。
difference_df = (df
.assign(difference=lambda x: x.groupby(['city', 'product'])['sold'].transform(lambda x: x.diff()))
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.