如何过滤 DataFrame 以在 Pandas 的列中的特定单词列表之后保留行？

Question

How can I filter a dataframe that keeps rows after specific list of words that is sorted by date?如何过滤 dataframe 以在按日期排序的特定单词列表之后保留行？ I have a df that looks like我有一个看起来像的 df

    Name    Date    Event   Col1
0   Sam 1/1/2020    Apple   Test1
1   Sam 1/2/2020    Apple   Test2
2   Sam 1/3/2020    BALL    Test1
3   Sam 1/3/2020    CAT Test2
4   Sam 1/5/2020    BALL    Test2
5   Sam 1/6/2020    Apple   Test3
6   Nick    1/5/2020    CAT Test3
7   Nick    1/6/2020    BALL    Test3
8   Nick    1/7/2020    Apple   Test3
9   Nick    1/8/2020    Apple   Test4
10  Cat 1/1/2020    Apple   Test1
11  Cat 1/2/2020    Bat Test2




 df=pd.DataFrame({'Name': {0: 'Sam',
  1: 'Sam',
  2: 'Sam',
  3: 'Sam',
  4: 'Sam',
  5: 'Sam',
  6: 'Nick',
  7: 'Nick',
  8: 'Nick',
  9: 'Nick',
  10: 'Cat',
  11: 'Cat '},
 'Date': {0: '1/1/2020',
  1: '1/2/2020',
  2: '1/3/2020',
  3: '1/3/2020',
  4: '1/5/2020',
  5: '1/6/2020',
  6: '1/5/2020',
  7: '1/6/2020',
  8: '1/7/2020',
  9: '1/8/2020',
  10: '1/1/2020',
  11: '1/2/2020'},
 'Event': {0: 'Apple',
  1: 'Apple',
  2: 'BALL',
  3: 'CAT',
  4: 'BALL',
  5: 'Apple',
  6: 'CAT',
  7: 'BALL',
  8: 'Apple',
  9: 'Apple',
  10: 'Apple',
  11: 'Bat'},
 'Col1': {0: 'Test1',
  1: 'Test2',
  2: 'Test1',
  3: 'Test2',
  4: 'Test2',
  5: 'Test3',
  6: 'Test3',
  7: 'Test3',
  8: 'Test3',
  9: 'Test4',
  10: 'Test1',
  11: 'Test2'}})

I would like to keep the rows after earliest date where BALL or CAT occurs in my event.我想保留在我的事件中发生 BALL 或 CAT 的最早日期之后的行。 So in my example, I would need to eliminate 1st 2 rows and 11th row since we have Apple as the first events.因此，在我的示例中，我需要消除第 2 行和第 11 行，因为我们将 Apple 作为第一个事件。

I tried using我尝试使用

event_filter = ['BALL','CAT']
df = df.loc[df['Event'].isin(event_filter)]

I also tried to remove the subset based on events but it removed 8th row as well.我还尝试根据事件删除子集，但它也删除了第 8 行。

Any help would be appreciated.任何帮助，将不胜感激。 The result I am expecting is:我期待的结果是：

    Name    Date    Event   Col1
0   Sam 1/3/2020    BALL    Test1
1   Sam 1/3/2020    CAT Test2
2   Sam 1/5/2020    BALL    Test2
3   Sam 1/6/2020    Apple   Test3
4   Nick    1/5/2020    CAT Test3
5   Nick    1/6/2020    BALL    Test3
6   Nick    1/7/2020    Apple   Test3
7   Nick    1/8/2020    Apple   Test4
8   Cat 1/2/2020    Bat Test2

Answer 1

It was a little hard to follow (did you switch the event filter from Bat to BALL? :D ), and it seems like you are trying the get the first event per person?有点难以理解（您是否将事件过滤器从 Bat 切换为 BALL？：D），并且您似乎正在尝试每人获取第一个事件？

If so I think you need to split the dataframe by name, filter as needed and then recombine.如果是这样，我认为您需要按名称拆分 dataframe，根据需要过滤然后重新组合。

here's small function to get the first occurence:这是第一次出现的小 function：

def get_min_index(ser, event_filter):

    in_event = ser.isin(event_filter)
    return in_event.loc[in_event].index[0]

Then assuming your df is already sorted as you want it.然后假设您的 df 已经按照您的需要进行了排序。

tdf_lst = []
names = df['Name'].unique()

for name in names:

    tdf = df.loc[df['Name']==name, :] # filter for the individual name
    min_idx = get_min_index(tdf['Event'], event_filter) # get the first index
    tdf = tdf.loc[min_idx:,:] # select from the first index to the last
    tdf_lst.append(tdf)
    
df_fltrd = pd.concat(tdf_lst)

maybe there's a more elegant solution but hopefully that's that you are looking for也许有一个更优雅的解决方案，但希望这就是您正在寻找的

Answer 2

how about something like this?这样的事情怎么样？ Also, it seems there is typo.另外，好像有错别字。 For last row, there is Bat, was this supposed to be BALL?最后一行是蝙蝠，这应该是球吗？ (According to your expected output ) （根据您的预期 output ）

lst = ['CAT', 'BALL']

check if that selected element of list exists in the event.检查事件中是否存在列表的选定元素。 if exist, give it 1 if doesn't exist, give it 0.如果存在，给它1 如果不存在，给它0。

df['C'] = np.where(df['Event'].isin(lst), 1, 0)

after this, we can do cumsum for column C and filter the rows.在此之后，我们可以对 C 列进行 cumsum 并过滤行。 This can be done by using groupby on Name and doing cumsum on column c and check if there exists cumsum greater than 0. The greater than 0 only happens if there exists those element of list in that event for that groupby (Name)这可以通过在名称上使用 groupby 并对列 c 执行 cumsum 并检查是否存在大于 0 的 cumsum 来完成。大于 0 仅在该事件中存在该 groupby （名称）的列表元素时发生

df = df.loc[df.groupby('Name')['C'].cumsum()>0].reset_index(drop=True)
df.drop('C', 1, inplace=True)
print (df)

   Name      Date  Event   Col1
0   Sam  1/3/2020   BALL  Test1
1   Sam  1/3/2020    CAT  Test2
2   Sam  1/5/2020   BALL  Test2
3   Sam  1/6/2020  Apple  Test3
4  Nick  1/5/2020    CAT  Test3
5  Nick  1/6/2020   BALL  Test3
6  Nick  1/7/2020  Apple  Test3
7  Nick  1/8/2020  Apple  Test4

如何过滤 DataFrame 以在 Pandas 的列中的特定单词列表之后保留行？

问题描述

2 个解决方案

解决方案1
0 2020-07-24 17:12:06

解决方案2
0 已采纳 2020-07-24 17:17:53

如何过滤 DataFrame 以在 Pandas 的列中的特定单词列表之后保留行？

问题描述

2 个解决方案

解决方案1 0 2020-07-24 17:12:06

解决方案2 0 已采纳 2020-07-24 17:17:53

解决方案1
0 2020-07-24 17:12:06

解决方案2
0 已采纳 2020-07-24 17:17:53