[英]Python Pandas - Appending data from multiple data frames onto same row by matching primary identifier, leave blank if no results from that data frame
Very new to python and using pandas, I only use it every once in a while when I'm trying to learn and automate otherwise a tedious Excel task. 对python和使用pandas来说很新,我只是偶尔使用它一次,当我试图学习和自动化其他繁琐的Excel任务时。 I've come upon a problem where I haven't exactly been able to find what I'm looking for through Google or here on Stack Overflow. 我遇到了一个问题,我无法通过谷歌或Stack Overflow找到我正在寻找的东西。
I currently have 6 different excel (.xlsx) files that I am able to parse and read into data frames. 我目前有6个不同的excel(.xlsx)文件,我能够解析并读入数据框。 However, whenever I try to append them together, they're simply added on as new rows in a final output excel files, but instead I'm trying to append similar data values onto the same row, and not the same column so that I can see whether or not this unique value shows up in these data sets or not. 但是,每当我尝试将它们附加在一起时,它们只是作为最终输出excel文件中的新行添加,而是我试图将类似的数据值附加到同一行,而不是相同的列,以便我可以查看这些唯一值是否显示在这些数据集中。 A shortened example is as follows 缩短的例子如下
[df1]
0 Col1 Col2
1 XYZ 41235
2 OAIS 15123
3 ABC 48938
[df2]
0 Col1 Col2
1 KFJ 21493
2 XYZ 43782
3 SHIZ 31299
4 ABC 33347
[Expected Output]
0 Col1 [df1] [df2]
1 XYZ 41235 43782
2 OAIS 15123
3 ABC 48938 33347
4 KFJ 21493
5 SHIZ 31299
I've tried to use a merge, however the actual data sheets are much more complicated in that I want to append 23 columns of data associated with each unique identifier in each data set. 我试图使用合并,但是实际的数据表要复杂得多,因为我想在每个数据集中附加与每个唯一标识符相关的23列数据。 Such as, [XYZ] in [df2] has associated information across the next 23 columns that I would want to append after the 23 columns from the [XYZ] values in [df1]. 例如,[df2]中的[XYZ]在接下来的23列中具有关联信息,我希望在[df1]中的[XYZ]值的23列之后追加这些信息。
How should I go about that? 我该怎么办呢? There are approximately 200 rows in each excel sheet and I would only need to essentially loop through until a matching unique identifier was found in [df2] with [df1], and then [df3] with [df1] and so on until [df6] and append those columns onto a new dataframe which would eventually be output as a new excel file. 每个excel表中大约有200行,我只需要基本循环,直到在[df2]中找到匹配的唯一标识符[df1],然后[df3]找到[df1],依此类推,直到[df6]并将这些列附加到新的数据帧上,该数据帧最终将作为新的excel文件输出。
df1 = pd.read_excel("set1.xlsx")
df2 = pd.read_excel("set2.xlsx")
df3 = pd.read_excel("set3.xlsx")
df4 = pd.read_excel("set4.xlsx")
df5 = pd.read_excel("set5.xlsx")
df6 = pd.read_excel("set6.xlsx")
Is currently the way I am reading into the excel files into data frames, I'm sure I could loop it, however, I am unsure of the best practices in doing so instead of hard coding each initialization of the data frame. 目前我正在将excel文件读入数据帧,我确信我可以循环它,但是,我不确定这样做的最佳实践,而不是硬编码数据帧的每次初始化。
You can use the merge function. 您可以使用合并功能。
pd.merge(df1, df2, on=['Col1'])
You can use multiple keys by adding to the list on
. 您可以通过添加到列表中使用多个键on
。
You can read more about the merge function in here 您可以在此处阅读有关合并功能的更多信息
If you need only certain of the columns you can reach it by: 如果您只需要某些列,则可以通过以下方式访问它:
df1.merge(df2['col1','col2']], on=['Col1'])
EDIT: 编辑:
In case of looping through some df's you can loop through all df's except the first and merge them all: 如果循环遍历某些df,你可以循环遍历除第一个之外的所有df并将它们全部合并:
df_list = [df2, df3, df4]
for df in df_list:
df1 = df1.merge(df['col1','col2']], on=['Col1'])
You need merge with the parameter how = 'outer' 你需要与参数how ='outer'合并
new_df = df1.merge(df2, on = 'Col1',how = 'outer', suffixes=('_df1', '_df2'))
You get 你得到
Col1 Col2_df1 Col2_df2
0 XYZ 41235.0 43782.0
1 OAIS 15123.0 NaN
2 ABC 48938.0 33347.0
3 KFJ NaN 21493.0
4 SHIZ NaN 31299.0
For iterative merging, consider storing data frames in a list and then run the chain merge with reduce()
. 对于迭代合并,请考虑将数据帧存储在列表中,然后使用reduce()
运行链合并。 Below creates a list of dataframes from a list comprehension through the Excel files where enumerate()
is used to rename the Col2 successively as df1 , df2 , etc. 下面通过Excel文件从列表推导创建数据帧列表,其中enumerate()
用于将Col2连续重命名为df1 , df2等。
from functools import reduce
...
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)})
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df = reduce(lambda x,y: pd.merge(x, y, on=['Col1'], how='outer'), dfList)
# Col1 df1 df2
# 0 XYZ 41235.0 43782.0
# 1 OAIS 15123.0 NaN
# 2 ABC 48938.0 33347.0
# 3 KFJ NaN 21493.0
# 4 SHIZ NaN 31299.0
Alternatively, use pd.concat
and outer join the dataframes horizontally where you need to set Col1
as index: 或者,使用pd.concat
和pd.concat
将数据帧水平连接到需要将Col1
设置为索引的位置:
dfList = [pd.read_excel(xl).rename(columns={'Col2':'df'+str(i)}).set_index('Col1')
for i,xl in enumerate(["set1.xlsx", "set2.xlsx", "set3.xlsx",
"set4.xlsx", "set5.xlsx", "set6.xlsx"])]
df2 = pd.concat(dfList, axis=1, join='outer', copy=False)\
.reset_index().rename(columns={'index':'Col1'})
# Col1 df1 df2
# 0 ABC 48938.0 33347.0
# 1 KFJ NaN 21493.0
# 2 OAIS 15123.0 NaN
# 3 SHIZ NaN 31299.0
# 4 XYZ 41235.0 43782.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.