简体   繁体   English

基于 2 个连续行的值过滤 pandas Dataframe

[英]Filter of pandas Dataframe based on values of 2 consecutive rows

I have a pandas Dataframe and I want to extract consecutive lines where:我有一个 pandas Dataframe 并且我想提取连续的行,其中:

  • two values of a given column correspond to 2 given values (in any order)给定列的两个值对应于 2 个给定值(以任何顺序)
  • one value in a column is the same一列中的一个值是相同的
  • two dates are 1 day apart两个日期相隔 1 天

To give a concrete example, let's say I have:举一个具体的例子,假设我有:

from datetime import datetime
import pandas as pd

df = pd.DataFrame([
    [datetime(2021, 1, 1), "Pizza", 50, "Some Place"],
    [datetime(2021, 1, 2), "Noddles", 36, "Some Place"],
    [datetime(2021, 1, 3), "Rice", 10, "Asian Delice"],
    [datetime(2021, 1, 4), "Pizza", 36, "Some Place"],
    [datetime(2021, 1, 5), "Steak", 75, "Steak House"],
    [datetime(2021, 1, 6), "Pizza", 52, "Another Place"],
    [datetime(2021, 1, 6), "Rice", 10, "Asian Delice"],
    [datetime(2021, 1, 7), "Noddles", 42, "Another Place"],
    [datetime(2021, 1, 8), "Dumplings", 12, "Asian Delice"],
    [datetime(2021, 1, 9), "Noddles", 39, "Some Place"],
    [datetime(2021, 1, 10), "Pizza", 53, "Some Place"],
    [datetime(2021, 1, 13), "Noddles", 0, "Some Place"],
    [datetime(2021, 1, 14), "Pizza", 0, "Another Place"],
], columns=["Date", "Food", "Cost", "Restaurant"])

I want to extract rows where in 2 consecutive days, we have Pizza and Noddles in the same restaurant, so the result would be:我想提取连续 2 天在同一家餐厅有 Pizza 和 Noddles 的行,所以结果是:

    Date        Food    Cost Restaurant
0   2021-01-01  Pizza   50  Some Place
1   2021-01-02  Noddles 36  Some Place
5   2021-01-06  Pizza   52  Another Place
7   2021-01-07  Noddles 42  Another Place
9   2021-01-09  Noddles 39  Some Place
10  2021-01-10  Pizza   53  Some Place

How could I achieve that with pandas?我怎样才能用 pandas 实现这一目标?

Let us do让我们做

df = df.loc[df.Food.isin(['Pizza','Noddles'])]
#filter the only food you need
s = df.Date.diff().dt.days.le(2)
# get the diff between each food , find the day diff less than 2 
out = df.groupby([df['Restaurant'],s]).filter(lambda x : pd.Series(['Pizza','Noddles']).isin(x['Food']).all())
Out[113]: 
         Date     Food  Cost     Restaurant
1  2021-01-02  Noddles    36     Some Place
3  2021-01-04    Pizza    36     Some Place
5  2021-01-06    Pizza    52  Another Place
7  2021-01-07  Noddles    42  Another Place
9  2021-01-09  Noddles    39     Some Place
10 2021-01-10    Pizza    53     Some Place

Inspired by @BENY (Thanks Beny) I came up with this solution that does not seem ideal but at least it works.受到@BENY 的启发(感谢 Beny),我想出了这个看起来并不理想但至少可行的解决方案。

  1. Filter Dataframe to only keep Noodles and Pizza过滤 Dataframe 只保留面条和披萨
  2. Create a new column for ID of restaurant (so we can do a diff to check it is the same)为餐厅 ID 创建一个新列(这样我们可以做一个差异来检查它是否相同)
  3. Diff rows based on date and restaurant ID to obtain a mask (note: we need to diff in both direction because we need 2 matches)根据日期和餐厅 ID 区分行以获得掩码(注意:我们需要双向区分,因为我们需要 2 个匹配项)
  4. Expected Dataframe would be retrieved by applying the mask通过应用掩码将检索预期的 Dataframe

Any suggestion to improve or alternate solution which is more "pandas-ic" is welcome;)欢迎任何改进或替代更“熊猫”的解决方案的建议;)

df = df[df.Food.isin(['Pizza','Noddles'])]
restaurants = list(set(df.Restaurant))
df["RestoID"] = df.apply(lambda row:restaurants.index(row.Restaurant), axis=1)
mask = df.Date.diff().dt.days.le(1) & df.RestoID.diff().eq(0) 
mask |=  df.Date.diff(-1).dt.days.ge(-1) & df.RestoID.diff(-1).eq(0)
df[mask].drop("RestoID", axis=1)

And the result is结果是

    Date        Food   Cost  Restaurant
0   2021-01-01  Pizza   50  Some Place
1   2021-01-02  Noddles 36  Some Place
5   2021-01-06  Pizza   52  Another Place
7   2021-01-07  Noddles 42  Another Place
9   2021-01-09  Noddles 39  Some Place
10  2021-01-10  Pizza   53  Some Place

A better and more elegant solution is to shift rows to perform computations, something like that:更好更优雅的解决方案是移动行以执行计算,如下所示:

df = df.loc[df.Food.isin(['Pizza','Noddles'])]
mask = False
for i in [-1, 1]:
    mask |= df.Date.diff(i).dt.days.le(i) & df.Food.ne(df.Food.shift(i)) & df.Restaurant.eq(df.Restaurant.shift(i))
df[mask]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM