简体   繁体   English

过滤多索引分组 pandas dataframe

[英]filter multi-indexed grouped pandas dataframe

The data looks like the following:数据如下所示:

id  timestamp   date        value
1   2001-01-01  2001-05-01  0
1   2001-10-01  2001-05-01  1
2   2001-01-01  2001-05-01  0
2   2001-10-01  2001-05-01  0

as you see the table contains the columns id , timestamp , date and value .如您所见,该表包含列idtimestampdatevalue Every row with the same id also has the same date .具有相同id的每一行也具有相同的date Furthermore date is timewise always somewhere in between the first and the last timestamp of each id .此外, date在时间上总是介于每个id的第一个和最后一个timestamp之间。

The task is to filter the table in the way to remove every id which does not contain at least one entry with value > 0 at a timestamp after their individual date .任务是过滤表,以删除每个id不包含至少一个value > 0的条目在其各自的date之后的时间戳。

I implemented it in the way that I multi-index the table with level 0 = id and level 1 = date and sort it.我实现它的方式是使用level 0 = idlevel 1 = date对表进行多索引并对其进行排序。 Then I group it by level 0 .然后我将它按level 0分组。 Next I loop through every group ( id ) and assign a new value telling me if the id is "good" (boolean).接下来,我遍历每个组( id )并分配一个新值,告诉我id是否“好”(布尔值)。 Finally I filter the table where good is True .最后,我过滤了 good 为True的表。

Unfortunately this implementation is slow like hell for a big (>10M rows) dataset.不幸的是,对于大型(>10M 行)数据集,这种实现速度非常慢。 I am looking for a way to speed this up.我正在寻找一种方法来加快速度。 My idea was using groupby.apply(lambda g: something) but I did not get it to work and I do not know if this is the fastest option possible.我的想法是使用groupby.apply(lambda g: something)但我没有让它工作,我不知道这是否是最快的选择。

Working Code Example:工作代码示例:

import pandas as pd

df = pd.DataFrame({'id': [1, 1, 2, 2],
                   'timestamp': ['01-01-2001', '01-10-2001', '01-01-2001', '01-10-2001'], 
                   'date': ['01-05-2001', '01-05-2001', '01-05-2001', '01-05-2001'],
                   'value': [0, 1, 0, 0]})                               

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id','timestamp']).sort_index()
grouped = df.groupby(level=0)
df['good'] = False
for i,(id,df_id) in enumerate(grouped):
    index = df_id.index
    df_id = df_id.droplevel(0)
    df.good.loc[index] = any(df_id.value.loc[df_id.date[0]:] > 0)
df = df[df.good == True]

For get all id s by 1 in value column and also timestamp are higher like date create 2 masks by Series.gt , chain by & for bitwise AND and then test if at least one True per group by GroupBy.any and GroupBy.transform :对于在value列中按1获取所有id并且timestamp也更高,例如date创建 2 个掩码,通过Series.gt创建 2 个掩码,按&链接按位AND ,然后通过GroupBy.anyGroupBy.transform测试每个组是否至少有一个True

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','timestamp'])

m = df['value'].gt(0) & df['timestamp'].gt(df['date'])
df = df[m.groupby(df['id']).transform('any')]
print (df)
   id  timestamp       date  value
0   1 2001-01-01 2001-01-05      0
1   1 2001-01-10 2001-01-05      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM