如何在不创建结果 dataframe 的情况下使用 pandas 从内部连接中获取索引对？

Question

I would like to merge two dataframes, but instead of returning the full join just return two columns showing the paired rows.我想合并两个数据框，但不是返回完整连接，而是返回显示成对行的两列。

I've written an example function below, but it creates the inner join of the two dataframes, including all columns that are not needed.我在下面编写了一个示例 function，但它创建了两个数据帧的内部连接，包括所有不需要的列。 If there are many columns and rows, this can use a lot of memory.如果有很多列和行，这个可以使用很多memory。

example例子

def get_index_pairs(df1, df2, on):
     return pd.merge(df1.reset_index(), df2.reset_index(), on=on)[['index_x', 'index_y']]
df1 = pd.DataFrame( dict( key = ["a","b","c","d"], v1=[1,2,3,4]))
df2 = pd.DataFrame( dict( key = ["b","d","f","g"], v2=[10,20,30,40]))

pairs = get_index_pairs(df1, df2, on='key')

print(pairs)

output output

   index_x  index_y
0        1        0
1        3        1

I'm looking for a more memory efficient version of get_index_pairs .我正在寻找更 memory 高效版本的get_index_pairs 。

Answer 1

Solution for Duplicate Keys (edited)重复键的解决方案（已编辑）

Data数据

The data is slightly adjusted for demo purpose.出于演示目的，对数据进行了微调。 Specifically, key="d" will have a 2*2 Cartesian join.具体来说， key="d"将有一个2*2笛卡尔连接。

import pandas as pd
import numpy as np
df1 = pd.DataFrame( dict( key = ["a","b","d","d"], v1=[1,2,3,4]))
df2 = pd.DataFrame( dict( key = ["b","d","d","g"], v2=[10,20,30,40]))

Code代码

Use np.argwhere() to return all matched indices.使用np.argwhere()返回所有匹配的索引。

ls = []
for i1, k1 in enumerate(df1["key"]):  # the only slow step (explicit for loop)
    for i2 in np.argwhere((df2["key"] == k1).values):
        ls.append([i1, i2[0]])

df_ans = pd.DataFrame(ls, columns=["index_x", "index_y"])

Result结果

print(df_ans)

   index_x  index_y
0        1        0
1        2        1   <-  Cartesian join on "d" like what
2        2        2   <-  would be produced by an
3        3        1   <-  inner join
4        3        2   <-

Note笔记

The OP asked for reduced memory usage , not faster execution. OP 要求减少 memory 的使用，而不是更快的执行。 Otherwise pd.merge() would be preferred in terms of speed.否则pd.merge()在速度方面将是首选。

Old Solution (no dup keys)旧解决方案（无 dup 密钥）

Construct dict s containing {key: index} pairs for your search.为您的搜索构造包含{key: index}对的dict 。 Only a constant multiple of the size of dfN["key"] would be consumed during the search process.在搜索过程中，只会消耗dfN["key"]大小的恒定倍数。

Code代码

# {key: index} mappings
dic1 = dict(zip(df1["key"].values, range(len(df1))))
dic2 = dict(zip(df2["key"].values, range(len(df2))))

# collect matched results
ls = []
for k1, v1 in dic1.items():  # the only slow step (explicit for loop)
    if k1 in dic2:  # fast (hashed search)
        ls.append([v1, dic2[k1]])

df_ans = pd.DataFrame(ls, columns=["index_x", "index_y"])

Result结果

print(df_ans)

   index_x  index_y
0        1        0
1        3        1

Answer 2

Another option is to use pandas DataFrame align , to align the axes bases on the key :另一种选择是使用 pandas DataFrame align ，以对齐key上的轴：

def get_index_pairs(df1, df2, on):
    left, right = pd.DataFrame.align(
        df1.set_index(on), df2.set_index(on), axis="index", join="inner"
    )
    left_index = df1.index[df1.loc[:, on].isin(left.index)]
    right_index = df2.index[df2.loc[:, on].isin(right.index)]
    return pd.DataFrame({"index_x": left_index, "index_y": right_index})


df1 = pd.DataFrame(dict(key=["a", "b", "c", "d"], v1=[1, 2, 3, 4]))
df2 = pd.DataFrame(dict(key=["b", "d", "f", "g"], v2=[10, 20, 30, 40]))

pairs = get_index_pairs(df1, df2, on="key")

print(pairs)

     index_x  index_y
0        1        0
1        3        1

如何在不创建结果 dataframe 的情况下使用 pandas 从内部连接中获取索引对？

问题描述

example例子

output output

2 个解决方案

解决方案1
2 2020-11-25 22:03:35

Solution for Duplicate Keys (edited)重复键的解决方案（已编辑）

Old Solution (no dup keys)旧解决方案（无 dup 密钥）

解决方案2
1 2020-11-25 22:22:39

如何在不创建结果 dataframe 的情况下使用 pandas 从内部连接中获取索引对？

问题描述

example例子

output output

2 个解决方案

解决方案1 2 2020-11-25 22:03:35

Solution for Duplicate Keys (edited)重复键的解决方案（已编辑）

Old Solution (no dup keys)旧解决方案（无 dup 密钥）

解决方案2 1 2020-11-25 22:22:39

解决方案1
2 2020-11-25 22:03:35

解决方案2
1 2020-11-25 22:22:39