简体   繁体   English

将行移动到另一个 dataframe 时如何加快 Pandas 包含

[英]How to speed up Pandas contains when moving rows to another dataframe

I have a small script which checks for a condition and if true moves the pandas dataframe row to a new dataframe and then removes the row from the original dataframe.我有一个小脚本,用于检查条件,如果为真,则将 pandas dataframe 行移动到新的 dataframe,然后从原始 dataframe 中删除该行。

Originally, I was doing with this regex but this was slow and after some reading on SO I tried it this way - it's slightly quicker.最初,我正在使用这个正则表达式,但是速度很慢,在阅读了一些内容之后,我尝试了这种方式 - 它稍微快一些。

The production data I'm using runs this across millions of rows so any time saved will be a big help.我正在使用的生产数据在数百万行中运行,因此节省的任何时间都会有很大帮助。

Anything I can do to optimise it further?我可以做些什么来进一步优化它吗?

import pandas as pd


data = [['thomas cook', 222], ['holidays', 333], ['cheap flights', 444], ['thomascook holidays', 555]]
df1 = pd.DataFrame(data, columns=['query', 'clicks'])
df2 = pd.DataFrame(columns=df1.columns)

print(df1)
                 query  clicks
0          thomas cook     222
1             holidays     333
2        cheap flights     444
3  thomascook holidays     555

brand_terms = ['thomas cook', 'thomascook', 'thomas-cook']
for brand_term in brand_terms:
    condtion = df1[df1["query"].str.contains(brand_term, case=False, regex=False)]
    df2 = df2.append(condtion, ignore_index=True)
    df1.drop(condtion.index, inplace=True)

print(df1)
           query  clicks
1       holidays     333
2  cheap flights     444


print(df2)
                 query clicks
0          thomas cook    222
1  thomascook holidays    555

You can use str.contains() and don't change regex parameter:您可以使用str.contains()并且不要更改regex参数:

df2=(df1.loc[df1["query"].str.contains(pat='|'.join(brand_terms), case=False)]
        .reset_index(drop=True))

output of df2: df2 的 output:

    query                   clicks
0   thomas cook             222
1   thomascook holidays     555

Update:更新:

You can use ~ (the bitwise negation operator) for that(for example):您可以为此使用~ (按位否定运算符)(例如):

df1=(df1.loc[~df1["query"].str.contains(pat='|'.join(brand_terms), case=False)])
        .reset_index(drop=True)

Note:笔记:

store your condition in a variable for simplicity and performance:为了简单性和性能,将您的条件存储在变量中:

m=df1["query"].str.contains(pat='|'.join(brand_terms), case=False)
df1=df1.loc[~m].reset_index(drop=True)
df2=df1.loc[m].reset_index(drop=True)

Try using Modin.尝试使用 Modin。 It will surely improve performance with just one line of code.Modin accelerates Pandas queries by 4x.只需一行代码,它肯定会提高性能。Modin 将 Pandas 查询加速了 4 倍。

import modin.pandas as pd

Here is an article to get started: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html `这里有一篇入门文章: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html`

  1. Try Vaex , as an alternative to pandas which the inte.net says is faster than pandas for string operations.试试Vaex ,作为 pandas 的替代品,inte.net 说它比 pandas 对于字符串操作更快。

  2. Other option that you can try is do a set_index on query.您可以尝试的其他选项是对查询执行 set_index。 I don't know the nature of your query column so can't really tell if this will help.我不知道您的查询列的性质,所以无法确定这是否有帮助。 If your query falls in a defined set, setting it as index will reduce rows to check and access that index can get you all rows如果您的查询属于定义的集合,将其设置为索引将减少要检查的行,访问该索引可以获得所有行

  3. Also you can change your regex to $'thomas\s?-?cook'^ and check您也可以将正则表达式更改为 $'thomas\s?-?cook'^ 并检查

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM