[英]filter multi-indexed grouped pandas dataframe
The data looks like the following:数据如下所示:
id timestamp date value
1 2001-01-01 2001-05-01 0
1 2001-10-01 2001-05-01 1
2 2001-01-01 2001-05-01 0
2 2001-10-01 2001-05-01 0
as you see the table contains the columns id
, timestamp
, date
and value
.如您所见,该表包含列
id
、 timestamp
、 date
和value
。 Every row with the same id
also has the same date
.具有相同
id
的每一行也具有相同的date
。 Furthermore date
is timewise always somewhere in between the first and the last timestamp
of each id
.此外,
date
在时间上总是介于每个id
的第一个和最后一个timestamp
之间。
The task is to filter the table in the way to remove every id
which does not contain at least one entry with value > 0
at a timestamp after their individual date
.任务是过滤表,以删除每个
id
不包含至少一个value > 0
的条目在其各自的date
之后的时间戳。
I implemented it in the way that I multi-index the table with level 0 = id
and level 1 = date
and sort it.我实现它的方式是使用
level 0 = id
和level 1 = date
对表进行多索引并对其进行排序。 Then I group it by level 0
.然后我将它按
level 0
分组。 Next I loop through every group ( id
) and assign a new value telling me if the id
is "good" (boolean).接下来,我遍历每个组(
id
)并分配一个新值,告诉我id
是否“好”(布尔值)。 Finally I filter the table where good is True
.最后,我过滤了 good 为
True
的表。
Unfortunately this implementation is slow like hell for a big (>10M rows) dataset.不幸的是,对于大型(>10M 行)数据集,这种实现速度非常慢。 I am looking for a way to speed this up.
我正在寻找一种方法来加快速度。 My idea was using
groupby.apply(lambda g: something)
but I did not get it to work and I do not know if this is the fastest option possible.我的想法是使用
groupby.apply(lambda g: something)
但我没有让它工作,我不知道这是否是最快的选择。
Working Code Example:工作代码示例:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 2, 2],
'timestamp': ['01-01-2001', '01-10-2001', '01-01-2001', '01-10-2001'],
'date': ['01-05-2001', '01-05-2001', '01-05-2001', '01-05-2001'],
'value': [0, 1, 0, 0]})
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id','timestamp']).sort_index()
grouped = df.groupby(level=0)
df['good'] = False
for i,(id,df_id) in enumerate(grouped):
index = df_id.index
df_id = df_id.droplevel(0)
df.good.loc[index] = any(df_id.value.loc[df_id.date[0]:] > 0)
df = df[df.good == True]
For get all id
s by 1
in value
column and also timestamp
are higher like date
create 2 masks by Series.gt
, chain by &
for bitwise AND
and then test if at least one True
per group by GroupBy.any
and GroupBy.transform
:对于在
value
列中按1
获取所有id
并且timestamp
也更高,例如date
创建 2 个掩码,通过Series.gt
创建 2 个掩码,按&
链接按位AND
,然后通过GroupBy.any
和GroupBy.transform
测试每个组是否至少有一个True
:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','timestamp'])
m = df['value'].gt(0) & df['timestamp'].gt(df['date'])
df = df[m.groupby(df['id']).transform('any')]
print (df)
id timestamp date value
0 1 2001-01-01 2001-01-05 0
1 1 2001-01-10 2001-01-05 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.