简体   繁体   English

在循环中多次过滤 pandas dataframe 的最快方法

[英]Fastest way to filter a pandas dataframe many times in a loop

I have a dataframe with 3 millions of rows (df1) and another with 10k rows (df2).我有一个 dataframe 有 300 万行 (df1) 和另一个有 10k 行 (df2)。 What is the fastest method of filtering df1 for each row in df2?为 df2 中的每一行过滤 df1 的最快方法是什么?

Here is exactly what I need to do in the loop:这正是我需要在循环中做的事情:

for i in list(range(len(df2))): #For each row
    x = df1[(df1['column1'].isin([df2['info1'][i]])) \
          & (df1['column2'].isin([df2['info2'][i]])) \
          & (df1['column3'].isin([df2['info3'][i]]))]
    # ..... More code using x variable every time ......

This code is not fast enough to be viable.这段代码不够快,不可行。

Note that I used.isin function but inside it there´s always only 1 item.请注意,我使用了 .isin function,但其中始终只有一项。 I found out that using.isin(), df1['column1'].isin([df2['info1'][i]] , was faster then using df1['column1'] == df2['info1'][i] .我发现 using.isin(), df1['column1'].isin([df2['info1'][i]]比使用df1['column1'] == df2['info1'][i]

import pandas as pd
import numpy as np


def make_filter(x, y, match_dict, uinque=False):
    filter = None
    for x_key in x.columns:
        if x_key in match_dict:
            y_key = match_dict[x_key]
            y_col = y[y_key]
            if uinque:
                y_col = y_col.unique()
            col_filter = x[x_key].isin(y[y_key])
            if filter is None:
                filter = col_filter
            else:
                filter = filter & col_filter
    return filter


def main():
    n_rows = 100
    x = np.random.randint(4, size=(n_rows, 2))
    x = pd.DataFrame(x, columns=["col1", "col2"])
    y = np.random.randint(2, 4, size=(n_rows, 2))
    y = pd.DataFrame(y, columns=["info1", "info2"])

    match_dict = {"col1":"info1", "col2": "info2"}
    z = make_filter(x, y, match_dict, uinque=True)

    print(x[z])


main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM