[英]Fastest way to filter a pandas dataframe many times in a loop
I have a dataframe with 3 millions of rows (df1) and another with 10k rows (df2).我有一个 dataframe 有 300 万行 (df1) 和另一个有 10k 行 (df2)。 What is the fastest method of filtering df1 for each row in df2?
为 df2 中的每一行过滤 df1 的最快方法是什么?
Here is exactly what I need to do in the loop:这正是我需要在循环中做的事情:
for i in list(range(len(df2))): #For each row
x = df1[(df1['column1'].isin([df2['info1'][i]])) \
& (df1['column2'].isin([df2['info2'][i]])) \
& (df1['column3'].isin([df2['info3'][i]]))]
# ..... More code using x variable every time ......
This code is not fast enough to be viable.这段代码不够快,不可行。
Note that I used.isin function but inside it there´s always only 1 item.请注意,我使用了 .isin function,但其中始终只有一项。 I found out that using.isin(),
df1['column1'].isin([df2['info1'][i]]
, was faster then using df1['column1'] == df2['info1'][i]
.我发现 using.isin(),
df1['column1'].isin([df2['info1'][i]]
比使用df1['column1'] == df2['info1'][i]
。
import pandas as pd
import numpy as np
def make_filter(x, y, match_dict, uinque=False):
filter = None
for x_key in x.columns:
if x_key in match_dict:
y_key = match_dict[x_key]
y_col = y[y_key]
if uinque:
y_col = y_col.unique()
col_filter = x[x_key].isin(y[y_key])
if filter is None:
filter = col_filter
else:
filter = filter & col_filter
return filter
def main():
n_rows = 100
x = np.random.randint(4, size=(n_rows, 2))
x = pd.DataFrame(x, columns=["col1", "col2"])
y = np.random.randint(2, 4, size=(n_rows, 2))
y = pd.DataFrame(y, columns=["info1", "info2"])
match_dict = {"col1":"info1", "col2": "info2"}
z = make_filter(x, y, match_dict, uinque=True)
print(x[z])
main()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.