从 Excel 文件列表创建 dataframe 以包括所有列和数据，即使列名在其他 Excel 文件中重复 - Python/Pandas

Question

Goal: Create a dataframe from all Excel files in a folder that ignores duplicating column headers from previously read Excel files.目标：从文件夹中的所有 Excel 文件创建 dataframe，该文件夹忽略来自先前读取的 Excel 文件的重复列标题。 Look at tables for illustration and clarity.查看表格以获得说明和清晰度。

Problem: Let's say File1.xlsx has column1, and column2 as column headers.问题：假设 File1.xlsx 有 column1，column2 作为列标题。 And File2.xlsx has column2, and column3 as column headers. File2.xlsx 有 column2 和 column3 作为列标题。 The code I have currently "combines" column2 from File1.xlsx, and File2.xlsx.我目前拥有的代码“组合”了来自 File1.xlsx 和 File2.xlsx 的 column2。 I was wondering if it's possible to create a dataframe that separates each column commensurate to each separate file.我想知道是否可以创建一个 dataframe 来分隔与每个单独文件相称的每一列。 I will create tables below to better demonstrate what I am looking for.我将在下面创建表格以更好地展示我在寻找什么。

File1.xlsx: column1, column2 File1.xlsx：column1，column2

File2.xlsx: column2, column3 File2.xlsx：column2，column3

File3.xlsx: column1, column3 File3.xlsx：column1，column3

Note: All the Excel files are in the folder in alphabetical/numerical order.注意：所有 Excel 文件都按字母/数字顺序位于文件夹中。

### Reading only the Excel files in the folder
FileList_xlsx = [f for f in files if f[-4:] == "xlsx"]
print(FileList_xlsx) # Prints list of Excel files in the folder

### Initializing Data Frame 
df = pd.DataFrame()

### Read Excel files into Python
for f in FileList_xlsx:
    test_df = pd.read_excel(f)
    df = df.append(test_df, ignore_index=True, sort = False)

### Code above only gives 3 columnns in the dataframe (as shown in the example below), 
### when I want 6 columns, even if they're duplicated in another Excel file in the folder
### I'm reading from.

What I'm getting:我得到了什么：

column1第 1 列	column2第 2 列	column3第 3 列
File 1文件 1	File 1文件 1
File 1文件 1	File 1文件 1
	File 2文件 2	File 2文件 2
	File 2文件 2	File 2文件 2
File 3文件 3		File 3文件 3
File 3文件 3		File 3文件 3

What I want:我想要的是：

column1第 1 列	column2第 2 列	column2第 2 列	column3第 3 列	column1第 1 列	column3第 3 列
File 1文件 1	File 1文件 1
File 1文件 1	File 1文件 1
		File 2文件 2	File 2文件 2
		File 2文件 2	File 2文件 2
				File 3文件 3	File 3文件 3
				File 3文件 3	File 3文件 3

Tip: To keep things simple, I made the data cells the appropriate file is supposed to be in.提示：为简单起见，我将数据单元格设置为应包含的相应文件。

Bonus: If you can also help with adding the option of populating the rest of the empty cells with NAs, that would be useful too!奖励：如果您还可以帮助添加使用 NA 填充空单元格的 rest 的选项，那也很有用！

Let me know if you're having trouble understanding my question, and I'll do my best to clarify.如果您无法理解我的问题，请告诉我，我会尽力澄清。

Answer 1

You could use df.add_prefix() to add a prefix to each column in the test_df as you load it and then append this to the df:您可以使用 df.add_prefix() 在加载时为 test_df 中的每一列添加一个前缀，然后将 append 这个添加到 df：

n = 1
for f in FileList:
    test_df = pd.read_excel(f)
    test_df = test_df.add_prefix(f'File{n}.')
    df = df.append(test_df, ignore_index=True, sort = False)
    n+=1

This will give you a unique column for each column you load.这将为您加载的每一列提供一个唯一的列。 Empty cells will be np.NaN.空单元格将是 np.NaN。

从 Excel 文件列表创建 dataframe 以包括所有列和数据，即使列名在其他 Excel 文件中重复 - Python/Pandas

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-05-27 04:06:29

从 Excel 文件列表创建 dataframe 以包括所有列和数据，即使列名在其他 Excel 文件中重复 - Python/Pandas

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-05-27 04:06:29

解决方案1
1 已采纳 2021-05-27 04:06:29