简体   繁体   English

熊猫:基于列数据合并或联接数据框?

[英]Pandas: Merge or join dataframes based on column data?

I am trying to add several columns of data to an existing dataframe. 我试图将几列数据添加到现有数据框。 The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. 数据框本身是由许多其他数据框构建而成的,我在相同的索引上成功加入了这些数据框。 For that, I used code like this: 为此,我使用了如下代码:

    data = p_data.join(r_data)

I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices: 我实际上是在多索引上加入这些的,因此数据框看起来如下所示,其中Name1和Name 2是索引:

    Name1    Name2    present    r      behavior
    a        1        1          0      0
             2        1          .5     2
             4        3          .125   1
    b        2        1          0      0
             4        5          .25    4
             8        1          0      1

So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). 因此,Name1索引不会重复数据,但是Name2索引会重复(我使用它来跟踪dyad,因此Name1和Name2在一起只能表示一次)。 What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). 我现在要添加的是与Name2数据相对应的4列数据(有关dyad的第二个成员的信息)。 Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. 与“当前”,“ r”和“行为”数据不同,这些数据是按个人而不是按对偶。 So I don't need to consider Name1 data when merging. 因此,合并时无需考虑Name1数据。

The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual: 问题是,虽然重复Name2数据以耗尽合并组合,但我现在想添加的数据中的“ Name2”列仅对每个Name2个人包含一个数据:

    Name2    Data1    Data2    Data3
    1        80       6        1
    2        61       8        3
    4        45       7        2
    8        30       3        6

What I would like the output to look like: 我希望输出看起来像什么:

    Name1    Name2    present    r      behavior    Data1    Data2    Data3
    a        1        1          0      0           80       6        1
             2        1          .5     2           61       8        3
             4        3          .125   1           45       7        2
    b        2        1          0      0           61       8        3
             4        5          .25    4           45       7        2
             8        1          0      1           30       3        6

Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. 尽管阅读了文档,但不清楚是否可以使用join()或merge()获得所需的结果。 If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. 如果尝试像以前使用的简单连接那样尝试连接到现有数据框,则会得到新的列,但它们充满了NaN值。 I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). 我还尝试了使用Name1和Name2作为列或索引的各种组合,并使用了join或merge(听起来不像是随机的,但是我显然没有正确解释文档!)。 Your help would be very much appreciated, as I am presently very much lost. 您的帮助将不胜感激,因为我目前非常迷失。

I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. 我不知道这是否是最好的方式,但你可以使用reset_index暂时使你的原始数据帧由索引Name2只。 Then you could perform the join as usual. 然后,您可以照常执行join Then use set_index to again make Name1 part of the MultiIndex: 然后使用set_index再犯Name1对多指标的一部分:

import pandas as pd

df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
                   'Name2':[1,2,4,2,4,8],
                   'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)

df2 = pd.DataFrame({'Data1':[80,61,45,30],
                    'Data2':[6,8,7,3]},
                   index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
#              present  Data1  Data2
# Name2 Name1                       
# 1     a            1     80      6
# 2     a            1     61      8
#       b            1     61      8
# 4     a            3     45      7
#       b            5     45      7
# 8     b            1     30      3

To make the result look even more like your desired DataFrame, you could reorder and sort the index: 为了使结果看起来更像您想要的DataFrame,可以对索引重新排序和排序:

print(result.reorder_levels([1,0],axis=0).sort(axis=0))
#              present  Data1  Data2
# Name1 Name2                       
# a     1            1     80      6
#       2            1     61      8
#       4            3     45      7
# b     2            1     61      8
#       4            5     45      7
#       8            1     30      3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM