[英]Creating a dataframe from list of Excel files to include all columns and data even if column name is duplicated in other Excel files - Python/Pandas
Goal: Create a dataframe from all Excel files in a folder that ignores duplicating column headers from previously read Excel files.目标:从文件夹中的所有 Excel 文件创建 dataframe,该文件夹忽略来自先前读取的 Excel 文件的重复列标题。 Look at tables for illustration and clarity.
查看表格以获得说明和清晰度。
Problem: Let's say File1.xlsx has column1, and column2 as column headers.问题:假设 File1.xlsx 有 column1,column2 作为列标题。 And File2.xlsx has column2, and column3 as column headers.
File2.xlsx 有 column2 和 column3 作为列标题。 The code I have currently "combines" column2 from File1.xlsx, and File2.xlsx.
我目前拥有的代码“组合”了来自 File1.xlsx 和 File2.xlsx 的 column2。 I was wondering if it's possible to create a dataframe that separates each column commensurate to each separate file.
我想知道是否可以创建一个 dataframe 来分隔与每个单独文件相称的每一列。 I will create tables below to better demonstrate what I am looking for.
我将在下面创建表格以更好地展示我在寻找什么。
File1.xlsx: column1, column2 File1.xlsx:column1,column2
File2.xlsx: column2, column3 File2.xlsx:column2,column3
File3.xlsx: column1, column3 File3.xlsx:column1,column3
Note: All the Excel files are in the folder in alphabetical/numerical order.注意:所有 Excel 文件都按字母/数字顺序位于文件夹中。
### Reading only the Excel files in the folder
FileList_xlsx = [f for f in files if f[-4:] == "xlsx"]
print(FileList_xlsx) # Prints list of Excel files in the folder
### Initializing Data Frame
df = pd.DataFrame()
### Read Excel files into Python
for f in FileList_xlsx:
test_df = pd.read_excel(f)
df = df.append(test_df, ignore_index=True, sort = False)
### Code above only gives 3 columnns in the dataframe (as shown in the example below),
### when I want 6 columns, even if they're duplicated in another Excel file in the folder
### I'm reading from.
What I'm getting:我得到了什么:
column1![]() |
column2![]() |
column3![]() |
---|---|---|
File 1![]() |
File 1![]() |
|
File 1![]() |
File 1![]() |
|
File 2![]() |
File 2![]() |
|
File 2![]() |
File 2![]() |
|
File 3![]() |
File 3![]() |
|
File 3![]() |
File 3![]() |
What I want:我想要的是:
column1![]() |
column2![]() |
column2![]() |
column3![]() |
column1![]() |
column3![]() |
---|---|---|---|---|---|
File 1![]() |
File 1![]() |
||||
File 1![]() |
File 1![]() |
||||
File 2![]() |
File 2![]() |
||||
File 2![]() |
File 2![]() |
||||
File 3![]() |
File 3![]() |
||||
File 3![]() |
File 3![]() |
Tip: To keep things simple, I made the data cells the appropriate file is supposed to be in.提示:为简单起见,我将数据单元格设置为应包含的相应文件。
Bonus: If you can also help with adding the option of populating the rest of the empty cells with NAs, that would be useful too!奖励:如果您还可以帮助添加使用 NA 填充空单元格的 rest 的选项,那也很有用!
Let me know if you're having trouble understanding my question, and I'll do my best to clarify.如果您无法理解我的问题,请告诉我,我会尽力澄清。
You could use df.add_prefix() to add a prefix to each column in the test_df as you load it and then append this to the df:您可以使用 df.add_prefix() 在加载时为 test_df 中的每一列添加一个前缀,然后将 append 这个添加到 df:
n = 1
for f in FileList:
test_df = pd.read_excel(f)
test_df = test_df.add_prefix(f'File{n}.')
df = df.append(test_df, ignore_index=True, sort = False)
n+=1
This will give you a unique column for each column you load.这将为您加载的每一列提供一个唯一的列。 Empty cells will be np.NaN.
空单元格将是 np.NaN。
File1.column1 | File1.column1 | File1.column2 |
File1.column2 | File2.column1 |
File2.column1 | File2.column2...
文件 2.列 2...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.