将行移动到另一个 dataframe 时如何加快 Pandas 包含

Question

I have a small script which checks for a condition and if true moves the pandas dataframe row to a new dataframe and then removes the row from the original dataframe.我有一个小脚本，用于检查条件，如果为真，则将 pandas dataframe 行移动到新的 dataframe，然后从原始 dataframe 中删除该行。

Originally, I was doing with this regex but this was slow and after some reading on SO I tried it this way - it's slightly quicker.最初，我正在使用这个正则表达式，但是速度很慢，在阅读了一些内容之后，我尝试了这种方式 - 它稍微快一些。

The production data I'm using runs this across millions of rows so any time saved will be a big help.我正在使用的生产数据在数百万行中运行，因此节省的任何时间都会有很大帮助。

Anything I can do to optimise it further?我可以做些什么来进一步优化它吗？

import pandas as pd


data = [['thomas cook', 222], ['holidays', 333], ['cheap flights', 444], ['thomascook holidays', 555]]
df1 = pd.DataFrame(data, columns=['query', 'clicks'])
df2 = pd.DataFrame(columns=df1.columns)

print(df1)
                 query  clicks
0          thomas cook     222
1             holidays     333
2        cheap flights     444
3  thomascook holidays     555

brand_terms = ['thomas cook', 'thomascook', 'thomas-cook']
for brand_term in brand_terms:
    condtion = df1[df1["query"].str.contains(brand_term, case=False, regex=False)]
    df2 = df2.append(condtion, ignore_index=True)
    df1.drop(condtion.index, inplace=True)

print(df1)
           query  clicks
1       holidays     333
2  cheap flights     444


print(df2)
                 query clicks
0          thomas cook    222
1  thomascook holidays    555

Answer 1

You can use str.contains() and don't change regex parameter:您可以使用str.contains()并且不要更改regex参数：

df2=(df1.loc[df1["query"].str.contains(pat='|'.join(brand_terms), case=False)]
        .reset_index(drop=True))

output of df2: df2 的 output：

    query                   clicks
0   thomas cook             222
1   thomascook holidays     555

Update:更新：

You can use ~ (the bitwise negation operator) for that(for example):您可以为此使用~ （按位否定运算符）（例如）：

df1=(df1.loc[~df1["query"].str.contains(pat='|'.join(brand_terms), case=False)])
        .reset_index(drop=True)

Note:笔记：

store your condition in a variable for simplicity and performance:为了简单性和性能，将您的条件存储在变量中：

m=df1["query"].str.contains(pat='|'.join(brand_terms), case=False)
df1=df1.loc[~m].reset_index(drop=True)
df2=df1.loc[m].reset_index(drop=True)

Answer 2

Try using Modin.尝试使用 Modin。 It will surely improve performance with just one line of code.Modin accelerates Pandas queries by 4x.只需一行代码，它肯定会提高性能。Modin 将 Pandas 查询加速了 4 倍。

import modin.pandas as pd

Here is an article to get started: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html `这里有一篇入门文章： https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html`

Answer 3

Try Vaex , as an alternative to pandas which the inte.net says is faster than pandas for string operations.试试Vaex ，作为 pandas 的替代品，inte.net 说它比 pandas 对于字符串操作更快。
Other option that you can try is do a set_index on query.您可以尝试的其他选项是对查询执行 set_index。 I don't know the nature of your query column so can't really tell if this will help.我不知道您的查询列的性质，所以无法确定这是否有帮助。 If your query falls in a defined set, setting it as index will reduce rows to check and access that index can get you all rows如果您的查询属于定义的集合，将其设置为索引将减少要检查的行，访问该索引可以获得所有行
Also you can change your regex to $'thomas\s?-?cook'^ and check您也可以将正则表达式更改为 $'thomas\s?-?cook'^ 并检查

将行移动到另一个 dataframe 时如何加快 Pandas 包含

问题描述

3 个解决方案

解决方案1
2 已采纳 2021-08-14 07:04:30

解决方案2
0 2021-08-14 07:07:36

解决方案3
0 2021-08-14 07:37:35

将行移动到另一个 dataframe 时如何加快 Pandas 包含

问题描述

3 个解决方案

解决方案1 2 已采纳 2021-08-14 07:04:30

解决方案2 0 2021-08-14 07:07:36

解决方案3 0 2021-08-14 07:37:35

解决方案1
2 已采纳 2021-08-14 07:04:30

解决方案2
0 2021-08-14 07:07:36

解决方案3
0 2021-08-14 07:37:35