简体   繁体   English

如何过滤 DataFrame 以在 Pandas 的列中的特定单词列表之后保留行?

[英]How can I filter a DataFrame that keeps the rows after a specific list of words in a columns in Pandas?

How can I filter a dataframe that keeps rows after specific list of words that is sorted by date?如何过滤 dataframe 以在按日期排序的特定单词列表之后保留行? I have a df that looks like我有一个看起来像的 df

    Name    Date    Event   Col1
0   Sam 1/1/2020    Apple   Test1
1   Sam 1/2/2020    Apple   Test2
2   Sam 1/3/2020    BALL    Test1
3   Sam 1/3/2020    CAT Test2
4   Sam 1/5/2020    BALL    Test2
5   Sam 1/6/2020    Apple   Test3
6   Nick    1/5/2020    CAT Test3
7   Nick    1/6/2020    BALL    Test3
8   Nick    1/7/2020    Apple   Test3
9   Nick    1/8/2020    Apple   Test4
10  Cat 1/1/2020    Apple   Test1
11  Cat 1/2/2020    Bat Test2




 df=pd.DataFrame({'Name': {0: 'Sam',
  1: 'Sam',
  2: 'Sam',
  3: 'Sam',
  4: 'Sam',
  5: 'Sam',
  6: 'Nick',
  7: 'Nick',
  8: 'Nick',
  9: 'Nick',
  10: 'Cat',
  11: 'Cat '},
 'Date': {0: '1/1/2020',
  1: '1/2/2020',
  2: '1/3/2020',
  3: '1/3/2020',
  4: '1/5/2020',
  5: '1/6/2020',
  6: '1/5/2020',
  7: '1/6/2020',
  8: '1/7/2020',
  9: '1/8/2020',
  10: '1/1/2020',
  11: '1/2/2020'},
 'Event': {0: 'Apple',
  1: 'Apple',
  2: 'BALL',
  3: 'CAT',
  4: 'BALL',
  5: 'Apple',
  6: 'CAT',
  7: 'BALL',
  8: 'Apple',
  9: 'Apple',
  10: 'Apple',
  11: 'Bat'},
 'Col1': {0: 'Test1',
  1: 'Test2',
  2: 'Test1',
  3: 'Test2',
  4: 'Test2',
  5: 'Test3',
  6: 'Test3',
  7: 'Test3',
  8: 'Test3',
  9: 'Test4',
  10: 'Test1',
  11: 'Test2'}})

I would like to keep the rows after earliest date where BALL or CAT occurs in my event.我想保留在我的事件中发生 BALL 或 CAT 的最早日期之后的行。 So in my example, I would need to eliminate 1st 2 rows and 11th row since we have Apple as the first events.因此,在我的示例中,我需要消除第 2 行和第 11 行,因为我们将 Apple 作为第一个事件。

I tried using我尝试使用

event_filter = ['BALL','CAT']
df = df.loc[df['Event'].isin(event_filter)]

I also tried to remove the subset based on events but it removed 8th row as well.我还尝试根据事件删除子集,但它也删除了第 8 行。

Any help would be appreciated.任何帮助,将不胜感激。 The result I am expecting is:我期待的结果是:

    Name    Date    Event   Col1
0   Sam 1/3/2020    BALL    Test1
1   Sam 1/3/2020    CAT Test2
2   Sam 1/5/2020    BALL    Test2
3   Sam 1/6/2020    Apple   Test3
4   Nick    1/5/2020    CAT Test3
5   Nick    1/6/2020    BALL    Test3
6   Nick    1/7/2020    Apple   Test3
7   Nick    1/8/2020    Apple   Test4
8   Cat 1/2/2020    Bat Test2

It was a little hard to follow (did you switch the event filter from Bat to BALL? :D ), and it seems like you are trying the get the first event per person?有点难以理解(您是否将事件过滤器从 Bat 切换为 BALL?:D),并且您似乎正在尝试每人获取第一个事件?

If so I think you need to split the dataframe by name, filter as needed and then recombine.如果是这样,我认为您需要按名称拆分 dataframe,根据需要过滤然后重新组合。

here's small function to get the first occurence:这是第一次出现的小 function:

def get_min_index(ser, event_filter):

    in_event = ser.isin(event_filter)
    return in_event.loc[in_event].index[0]

Then assuming your df is already sorted as you want it.然后假设您的 df 已经按照您的需要进行了排序。

tdf_lst = []
names = df['Name'].unique()

for name in names:

    tdf = df.loc[df['Name']==name, :] # filter for the individual name
    min_idx = get_min_index(tdf['Event'], event_filter) # get the first index
    tdf = tdf.loc[min_idx:,:] # select from the first index to the last
    tdf_lst.append(tdf)
    
df_fltrd = pd.concat(tdf_lst)

maybe there's a more elegant solution but hopefully that's that you are looking for也许有一个更优雅的解决方案,但希望这就是您正在寻找的

how about something like this?这样的事情怎么样? Also, it seems there is typo.另外,好像有错别字。 For last row, there is Bat, was this supposed to be BALL?最后一行是蝙蝠,这应该是球吗? (According to your expected output ) (根据您的预期 output )

lst = ['CAT', 'BALL']

check if that selected element of list exists in the event.检查事件中是否存在列表的选定元素。 if exist, give it 1 if doesn't exist, give it 0.如果存在,给它1 如果不存在,给它0。

df['C'] = np.where(df['Event'].isin(lst), 1, 0)

after this, we can do cumsum for column C and filter the rows.在此之后,我们可以对 C 列进行 cumsum 并过滤行。 This can be done by using groupby on Name and doing cumsum on column c and check if there exists cumsum greater than 0. The greater than 0 only happens if there exists those element of list in that event for that groupby (Name)这可以通过在名称上使用 groupby 并对列 c 执行 cumsum 并检查是否存在大于 0 的 cumsum 来完成。大于 0 仅在该事件中存在该 groupby (名称)的列表元素时发生

df = df.loc[df.groupby('Name')['C'].cumsum()>0].reset_index(drop=True)
df.drop('C', 1, inplace=True)
print (df)

   Name      Date  Event   Col1
0   Sam  1/3/2020   BALL  Test1
1   Sam  1/3/2020    CAT  Test2
2   Sam  1/5/2020   BALL  Test2
3   Sam  1/6/2020  Apple  Test3
4  Nick  1/5/2020    CAT  Test3
5  Nick  1/6/2020   BALL  Test3
6  Nick  1/7/2020  Apple  Test3
7  Nick  1/8/2020  Apple  Test4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM