Python Pandas - 通过匹配主标识符将多个数据帧中的数据附加到同一行，如果没有来自该数据帧的结果则留空

Question

Very new to python and using pandas, I only use it every once in a while when I'm trying to learn and automate otherwise a tedious Excel task. 对python和使用pandas来说很新，我只是偶尔使用它一次，当我试图学习和自动化其他繁琐的Excel任务时。 I've come upon a problem where I haven't exactly been able to find what I'm looking for through Google or here on Stack Overflow. 我遇到了一个问题，我无法通过谷歌或Stack Overflow找到我正在寻找的东西。

I currently have 6 different excel (.xlsx) files that I am able to parse and read into data frames. 我目前有6个不同的excel（.xlsx）文件，我能够解析并读入数据框。 However, whenever I try to append them together, they're simply added on as new rows in a final output excel files, but instead I'm trying to append similar data values onto the same row, and not the same column so that I can see whether or not this unique value shows up in these data sets or not. 但是，每当我尝试将它们附加在一起时，它们只是作为最终输出excel文件中的新行添加，而是我试图将类似的数据值附加到同一行，而不是相同的列，以便我可以查看这些唯一值是否显示在这些数据集中。 A shortened example is as follows 缩短的例子如下

[df1]
0    Col1    Col2    
1    XYZ     41235
2    OAIS    15123
3    ABC     48938

[df2]
 0   Col1    Col2
 1   KFJ     21493
 2   XYZ     43782
 3   SHIZ    31299
 4   ABC     33347

[Expected Output]
 0    Col1    [df1]     [df2]    
 1    XYZ     41235     43782
 2    OAIS    15123     
 3    ABC     48938     33347
 4    KFJ               21493
 5    SHIZ              31299

I've tried to use a merge, however the actual data sheets are much more complicated in that I want to append 23 columns of data associated with each unique identifier in each data set. 我试图使用合并，但是实际的数据表要复杂得多，因为我想在每个数据集中附加与每个唯一标识符相关的23列数据。 Such as, [XYZ] in [df2] has associated information across the next 23 columns that I would want to append after the 23 columns from the [XYZ] values in [df1]. 例如，[df2]中的[XYZ]在接下来的23列中具有关联信息，我希望在[df1]中的[XYZ]值的23列之后追加这些信息。

How should I go about that? 我该怎么办呢？ There are approximately 200 rows in each excel sheet and I would only need to essentially loop through until a matching unique identifier was found in [df2] with [df1], and then [df3] with [df1] and so on until [df6] and append those columns onto a new dataframe which would eventually be output as a new excel file. 每个excel表中大约有200行，我只需要基本循环，直到在[df2]中找到匹配的唯一标识符[df1]，然后[df3]找到[df1]，依此类推，直到[df6]并将这些列附加到新的数据帧上，该数据帧最终将作为新的excel文件输出。

df1 = pd.read_excel("set1.xlsx")
df2 = pd.read_excel("set2.xlsx")
df3 = pd.read_excel("set3.xlsx")
df4 = pd.read_excel("set4.xlsx")
df5 = pd.read_excel("set5.xlsx")
df6 = pd.read_excel("set6.xlsx")

Is currently the way I am reading into the excel files into data frames, I'm sure I could loop it, however, I am unsure of the best practices in doing so instead of hard coding each initialization of the data frame. 目前我正在将excel文件读入数据帧，我确信我可以循环它，但是，我不确定这样做的最佳实践，而不是硬编码数据帧的每次初始化。

Answer 1

You can use the merge function. 您可以使用合并功能。

pd.merge(df1, df2, on=['Col1'])

You can use multiple keys by adding to the list on . 您可以通过添加到列表中使用多个键on 。

You can read more about the merge function in here 您可以在此处阅读有关合并功能的更多信息

If you need only certain of the columns you can reach it by: 如果您只需要某些列，则可以通过以下方式访问它：

df1.merge(df2['col1','col2']], on=['Col1'])

EDIT: 编辑：

In case of looping through some df's you can loop through all df's except the first and merge them all: 如果循环遍历某些df，你可以循环遍历除第一个之外的所有df并将它们全部合并：

df_list = [df2, df3, df4]

for df in df_list:
     df1 = df1.merge(df['col1','col2']], on=['Col1'])

Answer 2

You need merge with the parameter how = 'outer' 你需要与参数how ='outer'合并

new_df = df1.merge(df2, on = 'Col1',how = 'outer', suffixes=('_df1', '_df2'))

You get 你得到

    Col1    Col2_df1    Col2_df2
0   XYZ     41235.0     43782.0
1   OAIS    15123.0     NaN
2   ABC     48938.0     33347.0
3   KFJ     NaN         21493.0
4   SHIZ    NaN         31299.0

Answer 3

For iterative merging, consider storing data frames in a list and then run the chain merge with reduce() . 对于迭代合并，请考虑将数据帧存储在列表中，然后使用reduce()运行链合并。 Below creates a list of dataframes from a list comprehension through the Excel files where enumerate() is used to rename the Col2 successively as df1 , df2 , etc. 下面通过Excel文件从列表推导创建数据帧列表，其中enumerate()用于将Col2连续重命名为df1 ， df2等。

from functools import reduce
...

dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)})
           for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx", 
                                  "set4.xlsx", "set5.xlsx", "set6.xlsx"])]

df = reduce(lambda x,y: pd.merge(x, y, on=['Col1'], how='outer'), dfList)

#    Col1      df1      df2
# 0   XYZ  41235.0  43782.0
# 1  OAIS  15123.0      NaN
# 2   ABC  48938.0  33347.0
# 3   KFJ      NaN  21493.0
# 4  SHIZ      NaN  31299.0

Alternatively, use pd.concat and outer join the dataframes horizontally where you need to set Col1 as index: 或者，使用pd.concat和pd.concat将数据帧水平连接到需要将Col1设置为索引的位置：

dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)}).set_index('Col1')
           for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx", 
                                  "set4.xlsx", "set5.xlsx", "set6.xlsx"])]

df2 = pd.concat(dfList, axis=1, join='outer', copy=False)\
                .reset_index().rename(columns={'index':'Col1'})

#    Col1      df1      df2
# 0   ABC  48938.0  33347.0
# 1   KFJ      NaN  21493.0
# 2  OAIS  15123.0      NaN
# 3  SHIZ      NaN  31299.0
# 4   XYZ  41235.0  43782.0

Python Pandas - 通过匹配主标识符将多个数据帧中的数据附加到同一行，如果没有来自该数据帧的结果则留空

问题描述

3 个解决方案

解决方案1
0 2017-05-10 13:44:39

解决方案2
0 2017-05-10 13:50:04

解决方案3
0 2017-05-10 16:01:27

Python Pandas - 通过匹配主标识符将多个数据帧中的数据附加到同一行，如果没有来自该数据帧的结果则留空

问题描述

3 个解决方案

解决方案1 0 2017-05-10 13:44:39

解决方案2 0 2017-05-10 13:50:04

解决方案3 0 2017-05-10 16:01:27

解决方案1
0 2017-05-10 13:44:39

解决方案2
0 2017-05-10 13:50:04

解决方案3
0 2017-05-10 16:01:27