简体   繁体   English

使用 pandas 合并时如何保留索引

[英]How to keep index when using pandas merge

I would like to merge two DataFrames , and keep the index from the first frame as the index on the merged dataset.我想合并两个DataFrames ,并将第一帧的索引保留为合并数据集的索引。 However, when I do the merge, the resulting DataFrame has integer index.但是,当我进行合并时,生成的 DataFrame 具有整数索引。 How can I specify that I want to keep the index from the left data frame?如何指定要保留左侧数据框中的索引?

In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3}, 
                          'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})

In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3}, 
                          'to_merge_on': {0: 1, 1: 3, 2: 5}})

In [6]: a
Out[6]:
   col1  to_merge_on
a     1            1
b     2            3
c     3            4

In [7]: b
Out[7]:
   col2  to_merge_on
0     1            1
1     2            3
2     3            5

In [8]: a.merge(b, how='left')
Out[8]:
   col1  to_merge_on  col2
0     1            1   1.0
1     2            3   2.0
2     3            4   NaN

In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')

EDIT: Switched to example code that can be easily reproduced编辑:切换到可以轻松复制的示例代码

In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
       col1  to_merge_on  col2
index
a         1            1     1
b         2            3     2
c         3            4   NaN

Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b .请注意,对于某些左合并操作,当ab之间有多个匹配时,您最终可能会得到比a更多的行。 In this case, you may need to drop duplicates .在这种情况下,您可能需要删除重复项

You can make a copy of index on left dataframe and do merge.您可以在左侧数据帧上制作索引副本并进行合并。

a['copy_index'] = a.index
a.merge(b, how='left')

I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof() ).我发现这个简单的方法在处理大型数据框和使用pd.merge_asof() (或dd.merge_asof() )时非常有用。

This approach would be superior when resetting index is expensive (large dataframe).当重置索引很昂贵(大数据框)时,这种方法会更好。

There is a non-pd.merge solution using Series.map and DataFrame.set_index .有一个使用Series.mapDataFrame.set_index的非 pd.merge 解决方案。

In: a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
In: a['col2']
Out:
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

This doesn't introduce a dummy index name for the index.这不会为索引引入虚拟index名称。

Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.但是请注意,没有DataFrame.map方法,因此这种方法不适用于多列。

df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)

这允许保留 df1 的索引

another simple option is to rename the index to what was before:另一个简单的选择是将索引重命名为以前的索引:

a.merge(b, how="left").set_axis(a.index)

merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis合并保留数据帧“a”的顺序,但只是重置索引,因此使用 set_axis 是安全的

Assuming that the resulting df has the same number of rows and order as your first df, you can do this:假设生成的 df 具有与您的第一个 df 相同的行数和顺序,您可以这样做:

c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)

Think I've come up with a different solution.我想我想出了一个不同的解决方案。 I was joining the left table on index value and the right table on a column value based off index of left table.我在索引值上加入左表,在基于左表索引的列值上加入右表。 What I did was a normal merge:我所做的是正常的合并:

First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')

Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:然后我从合并表中检索新的索引号,并将它们放在一个名为 Sentiment Line Number 的新列中:

First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()

Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):然后我手动将索引设置回原始的左表索引,该索引基于预先存在的名为行号的列(我从左表索引加入的列值):

First10ReviewsJoined.set_index('Line Number', inplace=True)

Then removed the index name of Line Number so that it remains blank:然后删除行号的索引名称,使其保持空白:

First10ReviewsJoined.index.name = None

Maybe a bit of a hack but seems to work well and relatively simple.也许有点破解,但似乎运作良好且相对简单。 Also, guess it reduces risk of duplicates/messing up your data.此外,猜测它会降低重复/弄乱数据的风险。 Hopefully that all makes sense.希望这一切都说得通。

For the people that wants to maintain the left index as it was before the left join:对于想要保持左索引与左连接之前一样的人:

def left_join(
    a: pandas.DataFrame, b: pandas.DataFrame, on: list[str], b_columns: list[str] = None
) -> pandas.DataFrame:
    if b_columns:
        b_columns = set(on + b_columns)
        b = b[b_columns]
    df = (
        a.reset_index()
        .merge(
            b,
            how="left",
            on=on,
        )
        .set_index(keys=[x or "index" for x in a.index.names])
    )
    df.index.names = a.index.names
    return df

You can also use DataFrame.join() method to achieve the same thing.您也可以使用DataFrame.join()方法来实现相同的目的。 The join method will persist the original index. join方法将保留原始索引。 The column to join can be specified with on parameter.可以使用on参数指定要加入的列。

In [17]: a.join(b.set_index("to_merge_on"), on="to_merge_on")
Out[17]: 
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM