简体   繁体   English

这是根据 pandas 中的条件删除 DataFrame 行的最有效方法?

[英]which is the most efficient way to remove DataFrame rows based on a condition in pandas?

I have a non-optimal solution to a problem and I'm searching for a better one.我有一个问题的非最佳解决方案,我正在寻找一个更好的解决方案。

My data looks like this:我的数据如下所示:

df = pd.DataFrame(columns=['id', 'score', 'duration', 'user'],
                  data=[[1, 800, 60, 'abc'], [1, 900, 60, 'zxc'], [2, 800, 250, 'abc'], [2, 5000, 250, 'bvc'],
                        [3, 6000, 250, 'zxc'], [3, 8000, 250, 'klp'], [4, 1400, 500,'kod'],
                        [4, 8000, 500, 'bvc']])```

As you can see instances are pairs of identical ids with the same duration and different scores.如您所见,实例是具有相同持续时间和不同分数的相同 ID 对。 My goal is to remove all id pairs that have a duration of less than 120 or where at least one user has a score less than 1500.我的目标是删除所有持续时间小于 120 或至少一个用户的分数小于 1500 的 id 对。

So far my solution is like this:到目前为止,我的解决方案是这样的:

# remove instances with duration > 120 (duration is the same for every instance of the same id)
df= df[df['duration'] > 120]

# groupby id and get the min value of score
test= df.groupby('id')['score'].min().reset_index()

# then I can get a list of the id's where at least one user has a score below 1500 and remove both instances with the same id

for x in list(test[test['score'] < 1500]['id']):
    df.drop(df.loc[df['id']==x].index, inplace=True)

However, the last bit is not very efficient and quite slow.但是,最后一点效率不高而且速度很慢。 I have around 700k instances in df and was wondering what is the most efficient way to remove all instances with id equal to the ones found in list(test[test['score'] < 1500]['id']).我在 df 中有大约 700k 个实例,想知道删除所有 id 等于 list(test[test['score'] < 1500]['id']) 中的所有实例的最有效方法是什么。 Also a note, for simplicity i used an integer for id in this example but my id's are objects that have this kind of format 4240c195g794530fj4e10z53.另请注意,为简单起见,我在此示例中使用 integer 作为 id,但我的 id 是具有这种格式 4240c195g794530fj4e10z53 的对象。

However, you're welcome to show me a better initial approach to this problem.但是,欢迎您向我展示解决此问题的更好的初始方法。 Thanks!谢谢!

You can first create the condition, then groupby on the boolean column based on the id column and then transform with all to retain groups that satisfies the condition for all the rows in the group.您可以先创建条件,然后根据 id 列对 boolean 列进行 groupby,然后使用all转换以保留满足组中所有行条件的组。

#retain duration greater than or equal to (ge) 120 and id that has score ge 1500
cond = df['duration'].ge(120) & df['score'].ge(1500)
out = df[cond.groupby(df['id']).transform('all')]

Or chaining them up in 1 line:或者将它们链接在 1 行中:

out = df[(df['duration'].ge(120) & df['score'].ge(1500))
                    .groupby(df['id']).transform('all')]

   id  score  duration user
4   3   6000       250  zxc
5   3   8000       250  klp

Making a loop to process pandas dataframe or numpy is almost always a bad idea with regards to performance.循环处理pandas dataframenumpy在性能方面几乎总是一个坏主意。 You need to use pandas or numpy methods, except the apply method which is not so performant.您需要使用pandasnumpy方法,但性能不那么好的apply方法除外。

I am adding anky's response and add two other slightly less performant solutions.我添加了 anky 的响应并添加了另外两个性能稍差的解决方案。


def with_isin(df):
    df= df[df['duration'] > 120]
    test= df.groupby('id')['score'].min()<1500
    return df.isin(test[test].index)

def with_join(df):
    df= df[df['duration'] > 120]
    test= df.groupby('id')['score'].min()<1500
    return df[df.join(test,rsuffix='_test', on='id')['score_test']]

def anky(df):
    return df[(df['duration'].ge(120) & df['score'].ge(1500))
                    .groupby(df['id']).transform('all')]

%timeit with_isin(df)
#>>> 1.22 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit with_join(df)
#>>> 2.23 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit anky(df)
#>>> 1.15 ms ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM