[英]How to speed up Pandas contains when moving rows to another dataframe
I have a small script which checks for a condition and if true moves the pandas dataframe row to a new dataframe and then removes the row from the original dataframe.我有一个小脚本,用于检查条件,如果为真,则将 pandas dataframe 行移动到新的 dataframe,然后从原始 dataframe 中删除该行。
Originally, I was doing with this regex but this was slow and after some reading on SO I tried it this way - it's slightly quicker.最初,我正在使用这个正则表达式,但是速度很慢,在阅读了一些内容之后,我尝试了这种方式 - 它稍微快一些。
The production data I'm using runs this across millions of rows so any time saved will be a big help.我正在使用的生产数据在数百万行中运行,因此节省的任何时间都会有很大帮助。
Anything I can do to optimise it further?我可以做些什么来进一步优化它吗?
import pandas as pd
data = [['thomas cook', 222], ['holidays', 333], ['cheap flights', 444], ['thomascook holidays', 555]]
df1 = pd.DataFrame(data, columns=['query', 'clicks'])
df2 = pd.DataFrame(columns=df1.columns)
print(df1)
query clicks
0 thomas cook 222
1 holidays 333
2 cheap flights 444
3 thomascook holidays 555
brand_terms = ['thomas cook', 'thomascook', 'thomas-cook']
for brand_term in brand_terms:
condtion = df1[df1["query"].str.contains(brand_term, case=False, regex=False)]
df2 = df2.append(condtion, ignore_index=True)
df1.drop(condtion.index, inplace=True)
print(df1)
query clicks
1 holidays 333
2 cheap flights 444
print(df2)
query clicks
0 thomas cook 222
1 thomascook holidays 555
You can use str.contains()
and don't change regex
parameter:您可以使用str.contains()
并且不要更改regex
参数:
df2=(df1.loc[df1["query"].str.contains(pat='|'.join(brand_terms), case=False)]
.reset_index(drop=True))
output of df2: df2 的 output:
query clicks
0 thomas cook 222
1 thomascook holidays 555
Update:更新:
You can use ~
(the bitwise negation operator) for that(for example):您可以为此使用~
(按位否定运算符)(例如):
df1=(df1.loc[~df1["query"].str.contains(pat='|'.join(brand_terms), case=False)])
.reset_index(drop=True)
Note:笔记:
store your condition in a variable for simplicity and performance:为了简单性和性能,将您的条件存储在变量中:
m=df1["query"].str.contains(pat='|'.join(brand_terms), case=False)
df1=df1.loc[~m].reset_index(drop=True)
df2=df1.loc[m].reset_index(drop=True)
Try using Modin.尝试使用 Modin。 It will surely improve performance with just one line of code.Modin accelerates Pandas queries by 4x.只需一行代码,它肯定会提高性能。Modin 将 Pandas 查询加速了 4 倍。
import modin.pandas as pd
Here is an article to get started: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html `这里有一篇入门文章: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html`
Try Vaex , as an alternative to pandas which the inte.net says is faster than pandas for string operations.试试Vaex ,作为 pandas 的替代品,inte.net 说它比 pandas 对于字符串操作更快。
Other option that you can try is do a set_index on query.您可以尝试的其他选项是对查询执行 set_index。 I don't know the nature of your query column so can't really tell if this will help.我不知道您的查询列的性质,所以无法确定这是否有帮助。 If your query falls in a defined set, setting it as index will reduce rows to check and access that index can get you all rows如果您的查询属于定义的集合,将其设置为索引将减少要检查的行,访问该索引可以获得所有行
Also you can change your regex to $'thomas\s?-?cook'^ and check您也可以将正则表达式更改为 $'thomas\s?-?cook'^ 并检查
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.