[英]How do I get the index pairs from an inner join using pandas without creating the resulting dataframe?
I would like to merge two dataframes, but instead of returning the full join just return two columns showing the paired rows.我想合并两个数据框,但不是返回完整连接,而是返回显示成对行的两列。
I've written an example function below, but it creates the inner join of the two dataframes, including all columns that are not needed.我在下面编写了一个示例 function,但它创建了两个数据帧的内部连接,包括所有不需要的列。 If there are many columns and rows, this can use a lot of memory.如果有很多列和行,这个可以使用很多memory。
def get_index_pairs(df1, df2, on):
return pd.merge(df1.reset_index(), df2.reset_index(), on=on)[['index_x', 'index_y']]
df1 = pd.DataFrame( dict( key = ["a","b","c","d"], v1=[1,2,3,4]))
df2 = pd.DataFrame( dict( key = ["b","d","f","g"], v2=[10,20,30,40]))
pairs = get_index_pairs(df1, df2, on='key')
print(pairs)
index_x index_y
0 1 0
1 3 1
I'm looking for a more memory efficient version of get_index_pairs
.我正在寻找更 memory 高效版本的get_index_pairs
。
Data数据
The data is slightly adjusted for demo purpose.出于演示目的,对数据进行了微调。 Specifically, key="d"
will have a 2*2
Cartesian join.具体来说, key="d"
将有一个2*2
笛卡尔连接。
import pandas as pd
import numpy as np
df1 = pd.DataFrame( dict( key = ["a","b","d","d"], v1=[1,2,3,4]))
df2 = pd.DataFrame( dict( key = ["b","d","d","g"], v2=[10,20,30,40]))
Code代码
Use np.argwhere()
to return all matched indices.使用np.argwhere()
返回所有匹配的索引。
ls = []
for i1, k1 in enumerate(df1["key"]): # the only slow step (explicit for loop)
for i2 in np.argwhere((df2["key"] == k1).values):
ls.append([i1, i2[0]])
df_ans = pd.DataFrame(ls, columns=["index_x", "index_y"])
Result结果
print(df_ans)
index_x index_y
0 1 0
1 2 1 <- Cartesian join on "d" like what
2 2 2 <- would be produced by an
3 3 1 <- inner join
4 3 2 <-
Note笔记
The OP asked for reduced memory usage , not faster execution. OP 要求减少 memory 的使用,而不是更快的执行。 Otherwise pd.merge()
would be preferred in terms of speed.否则pd.merge()
在速度方面将是首选。
Construct dict
s containing {key: index}
pairs for your search.为您的搜索构造包含{key: index}
对的dict
。 Only a constant multiple of the size of dfN["key"]
would be consumed during the search process.在搜索过程中,只会消耗dfN["key"]
大小的恒定倍数。
Code代码
# {key: index} mappings
dic1 = dict(zip(df1["key"].values, range(len(df1))))
dic2 = dict(zip(df2["key"].values, range(len(df2))))
# collect matched results
ls = []
for k1, v1 in dic1.items(): # the only slow step (explicit for loop)
if k1 in dic2: # fast (hashed search)
ls.append([v1, dic2[k1]])
df_ans = pd.DataFrame(ls, columns=["index_x", "index_y"])
Result结果
print(df_ans)
index_x index_y
0 1 0
1 3 1
Another option is to use pandas DataFrame align
, to align the axes bases on the key
:另一种选择是使用 pandas DataFrame align
,以对齐key
上的轴:
def get_index_pairs(df1, df2, on):
left, right = pd.DataFrame.align(
df1.set_index(on), df2.set_index(on), axis="index", join="inner"
)
left_index = df1.index[df1.loc[:, on].isin(left.index)]
right_index = df2.index[df2.loc[:, on].isin(right.index)]
return pd.DataFrame({"index_x": left_index, "index_y": right_index})
df1 = pd.DataFrame(dict(key=["a", "b", "c", "d"], v1=[1, 2, 3, 4]))
df2 = pd.DataFrame(dict(key=["b", "d", "f", "g"], v2=[10, 20, 30, 40]))
pairs = get_index_pairs(df1, df2, on="key")
print(pairs)
index_x index_y
0 1 0
1 3 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.