简体   繁体   English

Python Pandas - 通过匹配主标识符将多个数据帧中的数据附加到同一行,如果没有来自该数据帧的结果则留空

[英]Python Pandas - Appending data from multiple data frames onto same row by matching primary identifier, leave blank if no results from that data frame

Very new to python and using pandas, I only use it every once in a while when I'm trying to learn and automate otherwise a tedious Excel task. 对python和使用pandas来说很新,我只是偶尔使用它一次,当我试图学习和自动化其他繁琐的Excel任务时。 I've come upon a problem where I haven't exactly been able to find what I'm looking for through Google or here on Stack Overflow. 我遇到了一个问题,我无法通过谷歌或Stack Overflow找到我正在寻找的东西。

I currently have 6 different excel (.xlsx) files that I am able to parse and read into data frames. 我目前有6个不同的excel(.xlsx)文件,我能够解析并读入数据框。 However, whenever I try to append them together, they're simply added on as new rows in a final output excel files, but instead I'm trying to append similar data values onto the same row, and not the same column so that I can see whether or not this unique value shows up in these data sets or not. 但是,每当我尝试将它们附加在一起时,它们只是作为最终输出excel文件中的新行添加,而是我试图将类似的数据值附加到同一行,而不是相同的列,以便我可以查看这些唯一值是否显示在这些数据集中。 A shortened example is as follows 缩短的例子如下

[df1]
0    Col1    Col2    
1    XYZ     41235
2    OAIS    15123
3    ABC     48938

[df2]
 0   Col1    Col2
 1   KFJ     21493
 2   XYZ     43782
 3   SHIZ    31299
 4   ABC     33347

[Expected Output]
 0    Col1    [df1]     [df2]    
 1    XYZ     41235     43782
 2    OAIS    15123     
 3    ABC     48938     33347
 4    KFJ               21493
 5    SHIZ              31299

I've tried to use a merge, however the actual data sheets are much more complicated in that I want to append 23 columns of data associated with each unique identifier in each data set. 我试图使用合并,但是实际的数据表要复杂得多,因为我想在每个数据集中附加与每个唯一标识符相关的23列数据。 Such as, [XYZ] in [df2] has associated information across the next 23 columns that I would want to append after the 23 columns from the [XYZ] values in [df1]. 例如,[df2]中的[XYZ]在接下来的23列中具有关联信息,我希望在[df1]中的[XYZ]值的23列之后追加这些信息。

How should I go about that? 我该怎么办呢? There are approximately 200 rows in each excel sheet and I would only need to essentially loop through until a matching unique identifier was found in [df2] with [df1], and then [df3] with [df1] and so on until [df6] and append those columns onto a new dataframe which would eventually be output as a new excel file. 每个excel表中大约有200行,我只需要基本循环,直到在[df2]中找到匹配的唯一标识符[df1],然后[df3]找到[df1],依此类推,直到[df6]并将这些列附加到新的数据帧上,该数据帧最终将作为新的excel文件输出。

df1 = pd.read_excel("set1.xlsx")
df2 = pd.read_excel("set2.xlsx")
df3 = pd.read_excel("set3.xlsx")
df4 = pd.read_excel("set4.xlsx")
df5 = pd.read_excel("set5.xlsx")
df6 = pd.read_excel("set6.xlsx")

Is currently the way I am reading into the excel files into data frames, I'm sure I could loop it, however, I am unsure of the best practices in doing so instead of hard coding each initialization of the data frame. 目前我正在将excel文件读入数据帧,我确信我可以循环它,但是,我不确定这样做的最佳实践,而不是硬编码数据帧的每次初始化。

You can use the merge function. 您可以使用合并功能。

pd.merge(df1, df2, on=['Col1'])

You can use multiple keys by adding to the list on . 您可以通过添加到列表中使用多个键on

You can read more about the merge function in here 您可以在此处阅读有关合并功能的更多信息

If you need only certain of the columns you can reach it by: 如果您只需要某些列,则可以通过以下方式访问它:

df1.merge(df2['col1','col2']], on=['Col1'])

EDIT: 编辑:

In case of looping through some df's you can loop through all df's except the first and merge them all: 如果循环遍历某些df,你可以循环遍历除第一个之外的所有df并将它们全部合并:

df_list = [df2, df3, df4]

for df in df_list:
     df1 = df1.merge(df['col1','col2']], on=['Col1'])

You need merge with the parameter how = 'outer' 你需要与参数how ='outer'合并

new_df = df1.merge(df2, on = 'Col1',how = 'outer', suffixes=('_df1', '_df2'))

You get 你得到

    Col1    Col2_df1    Col2_df2
0   XYZ     41235.0     43782.0
1   OAIS    15123.0     NaN
2   ABC     48938.0     33347.0
3   KFJ     NaN         21493.0
4   SHIZ    NaN         31299.0

For iterative merging, consider storing data frames in a list and then run the chain merge with reduce() . 对于迭代合并,请考虑将数据帧存储在列表中,然后使用reduce()运行链合并。 Below creates a list of dataframes from a list comprehension through the Excel files where enumerate() is used to rename the Col2 successively as df1 , df2 , etc. 下面通过Excel文件从列表推导创建数据帧列表,其中enumerate()用于将Col2连续重命名为df1df2等。

from functools import reduce
...

dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)})
           for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx", 
                                  "set4.xlsx", "set5.xlsx", "set6.xlsx"])]

df = reduce(lambda x,y: pd.merge(x, y, on=['Col1'], how='outer'), dfList)

#    Col1      df1      df2
# 0   XYZ  41235.0  43782.0
# 1  OAIS  15123.0      NaN
# 2   ABC  48938.0  33347.0
# 3   KFJ      NaN  21493.0
# 4  SHIZ      NaN  31299.0

Alternatively, use pd.concat and outer join the dataframes horizontally where you need to set Col1 as index: 或者,使用pd.concatpd.concat将数据帧水平连接到需要将Col1设置为索引的位置:

dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)}).set_index('Col1')
           for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx", 
                                  "set4.xlsx", "set5.xlsx", "set6.xlsx"])]

df2 = pd.concat(dfList, axis=1, join='outer', copy=False)\
                .reset_index().rename(columns={'index':'Col1'})

#    Col1      df1      df2
# 0   ABC  48938.0  33347.0
# 1   KFJ      NaN  21493.0
# 2  OAIS  15123.0      NaN
# 3  SHIZ      NaN  31299.0
# 4   XYZ  41235.0  43782.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM