简体   繁体   English

如何在不创建结果 dataframe 的情况下使用 pandas 从内部连接中获取索引对?

[英]How do I get the index pairs from an inner join using pandas without creating the resulting dataframe?

I would like to merge two dataframes, but instead of returning the full join just return two columns showing the paired rows.我想合并两个数据框,但不是返回完整连接,而是返回显示成对行的两列。

I've written an example function below, but it creates the inner join of the two dataframes, including all columns that are not needed.我在下面编写了一个示例 function,但它创建了两个数据帧的内部连接,包括所有不需要的列。 If there are many columns and rows, this can use a lot of memory.如果有很多列和行,这个可以使用很多memory。

example例子

def get_index_pairs(df1, df2, on):
     return pd.merge(df1.reset_index(), df2.reset_index(), on=on)[['index_x', 'index_y']]
df1 = pd.DataFrame( dict( key = ["a","b","c","d"], v1=[1,2,3,4]))
df2 = pd.DataFrame( dict( key = ["b","d","f","g"], v2=[10,20,30,40]))

pairs = get_index_pairs(df1, df2, on='key')

print(pairs)

output output

   index_x  index_y
0        1        0
1        3        1

I'm looking for a more memory efficient version of get_index_pairs .我正在寻找更 memory 高效版本的get_index_pairs

Solution for Duplicate Keys (edited)重复键的解决方案(已编辑)

Data数据

The data is slightly adjusted for demo purpose.出于演示目的,对数据进行了微调。 Specifically, key="d" will have a 2*2 Cartesian join.具体来说, key="d"将有一个2*2笛卡尔连接。

import pandas as pd
import numpy as np
df1 = pd.DataFrame( dict( key = ["a","b","d","d"], v1=[1,2,3,4]))
df2 = pd.DataFrame( dict( key = ["b","d","d","g"], v2=[10,20,30,40]))

Code代码

Use np.argwhere() to return all matched indices.使用np.argwhere()返回所有匹配的索引。

ls = []
for i1, k1 in enumerate(df1["key"]):  # the only slow step (explicit for loop)
    for i2 in np.argwhere((df2["key"] == k1).values):
        ls.append([i1, i2[0]])

df_ans = pd.DataFrame(ls, columns=["index_x", "index_y"])

Result结果

print(df_ans)

   index_x  index_y
0        1        0
1        2        1   <-  Cartesian join on "d" like what
2        2        2   <-  would be produced by an
3        3        1   <-  inner join
4        3        2   <-

Note笔记

The OP asked for reduced memory usage , not faster execution. OP 要求减少 memory 的使用,而不是更快的执行。 Otherwise pd.merge() would be preferred in terms of speed.否则pd.merge()在速度方面将是首选。

Old Solution (no dup keys)旧解决方案(无 dup 密钥)

Construct dict s containing {key: index} pairs for your search.为您的搜索构造包含{key: index}对的dict Only a constant multiple of the size of dfN["key"] would be consumed during the search process.在搜索过程中,只会消耗dfN["key"]大小的恒定倍数。

Code代码

# {key: index} mappings
dic1 = dict(zip(df1["key"].values, range(len(df1))))
dic2 = dict(zip(df2["key"].values, range(len(df2))))

# collect matched results
ls = []
for k1, v1 in dic1.items():  # the only slow step (explicit for loop)
    if k1 in dic2:  # fast (hashed search)
        ls.append([v1, dic2[k1]])

df_ans = pd.DataFrame(ls, columns=["index_x", "index_y"])

Result结果

print(df_ans)

   index_x  index_y
0        1        0
1        3        1

Another option is to use pandas DataFrame align , to align the axes bases on the key :另一种选择是使用 pandas DataFrame align ,以对齐key上的轴:

def get_index_pairs(df1, df2, on):
    left, right = pd.DataFrame.align(
        df1.set_index(on), df2.set_index(on), axis="index", join="inner"
    )
    left_index = df1.index[df1.loc[:, on].isin(left.index)]
    right_index = df2.index[df2.loc[:, on].isin(right.index)]
    return pd.DataFrame({"index_x": left_index, "index_y": right_index})


df1 = pd.DataFrame(dict(key=["a", "b", "c", "d"], v1=[1, 2, 3, 4]))
df2 = pd.DataFrame(dict(key=["b", "d", "f", "g"], v2=[10, 20, 30, 40]))

pairs = get_index_pairs(df1, df2, on="key")

print(pairs)

     index_x  index_y
0        1        0
1        3        1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Numpy 数组创建 Pandas DataFrame:如何指定索引列和列标题? - Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers? 如何将 Pandas DataFrame 绘制为没有索引列的表? - How do I plot a pandas DataFrame as a table without the index column? 如何在没有索引的情况下转置熊猫中的数据帧? - How do I transpose dataframe in pandas without index? Python3:使用Python连接到Postgresql数据库...如何获取结果查询以大熊猫数据帧中的行形式返回? - Python3: Connecting into postgresql database with Python… how do I get resulting query to return as rows in a pandas dataframe? 如何使用 boolean 索引索引 pandas dataframe? - How do I index an pandas dataframe using boolean indexing? 如何使用 pandas 中的日期时间索引列表索引 dataframe? - How do I index a dataframe using a list of datetime indices in pandas? How do I turn a Pandas DataFrame object with 1 main column into a Pandas Series with the index column from the original DataFrame - How do I turn a Pandas DataFrame object with 1 main column into a Pandas Series with the index column from the original DataFrame 如何对 pandas DataFrame 中的内部列表进行排序? - How do I sort inner list in pandas DataFrame? 如何根据multiIndex DataFrame的内部索引执行操作? - How do I perform operations according to the inner index of a multiIndex DataFrame? 如何重置 pandas dataframe 上的索引,在特定 position 上插入结果列? - How to reset index on pandas dataframe inserting resulting column on a specific position?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM