在循环中多次过滤 pandas dataframe 的最快方法

Question

I have a dataframe with 3 millions of rows (df1) and another with 10k rows (df2).我有一个 dataframe 有 300 万行 (df1) 和另一个有 10k 行 (df2)。 What is the fastest method of filtering df1 for each row in df2?为 df2 中的每一行过滤 df1 的最快方法是什么？

Here is exactly what I need to do in the loop:这正是我需要在循环中做的事情：

for i in list(range(len(df2))): #For each row
    x = df1[(df1['column1'].isin([df2['info1'][i]])) \
          & (df1['column2'].isin([df2['info2'][i]])) \
          & (df1['column3'].isin([df2['info3'][i]]))]
    # ..... More code using x variable every time ......

This code is not fast enough to be viable.这段代码不够快，不可行。

Note that I used.isin function but inside it there´s always only 1 item.请注意，我使用了 .isin function，但其中始终只有一项。 I found out that using.isin(), df1['column1'].isin([df2['info1'][i]] , was faster then using df1['column1'] == df2['info1'][i] .我发现 using.isin(), df1['column1'].isin([df2['info1'][i]]比使用df1['column1'] == df2['info1'][i] 。

Answer 1

import pandas as pd
import numpy as np


def make_filter(x, y, match_dict, uinque=False):
    filter = None
    for x_key in x.columns:
        if x_key in match_dict:
            y_key = match_dict[x_key]
            y_col = y[y_key]
            if uinque:
                y_col = y_col.unique()
            col_filter = x[x_key].isin(y[y_key])
            if filter is None:
                filter = col_filter
            else:
                filter = filter & col_filter
    return filter


def main():
    n_rows = 100
    x = np.random.randint(4, size=(n_rows, 2))
    x = pd.DataFrame(x, columns=["col1", "col2"])
    y = np.random.randint(2, 4, size=(n_rows, 2))
    y = pd.DataFrame(y, columns=["info1", "info2"])

    match_dict = {"col1":"info1", "col2": "info2"}
    z = make_filter(x, y, match_dict, uinque=True)

    print(x[z])


main()

在循环中多次过滤 pandas dataframe 的最快方法

问题描述

1 个解决方案

解决方案1
0 2020-11-13 16:33:45

在循环中多次过滤 pandas dataframe 的最快方法

问题描述

1 个解决方案

解决方案1 0 2020-11-13 16:33:45

解决方案1
0 2020-11-13 16:33:45