简体   繁体   English

从 Excel 文件列表创建 dataframe 以包括所有列和数据,即使列名在其他 Excel 文件中重复 - Python/Pandas

[英]Creating a dataframe from list of Excel files to include all columns and data even if column name is duplicated in other Excel files - Python/Pandas

Goal: Create a dataframe from all Excel files in a folder that ignores duplicating column headers from previously read Excel files.目标:从文件夹中的所有 Excel 文件创建 dataframe,该文件夹忽略来自先前读取的 Excel 文件的重复列标题。 Look at tables for illustration and clarity.查看表格以获得说明和清晰度。

Problem: Let's say File1.xlsx has column1, and column2 as column headers.问题:假设 File1.xlsx 有 column1,column2 作为列标题。 And File2.xlsx has column2, and column3 as column headers. File2.xlsx 有 column2 和 column3 作为列标题。 The code I have currently "combines" column2 from File1.xlsx, and File2.xlsx.我目前拥有的代码“组合”了来自 File1.xlsx 和 File2.xlsx 的 column2。 I was wondering if it's possible to create a dataframe that separates each column commensurate to each separate file.我想知道是否可以创建一个 dataframe 来分隔与每个单独文件相称的每一列。 I will create tables below to better demonstrate what I am looking for.我将在下面创建表格以更好地展示我在寻找什么。

File1.xlsx: column1, column2 File1.xlsx:column1,column2

File2.xlsx: column2, column3 File2.xlsx:column2,column3

File3.xlsx: column1, column3 File3.xlsx:column1,column3

Note: All the Excel files are in the folder in alphabetical/numerical order.注意:所有 Excel 文件都按字母/数字顺序位于文件夹中。

### Reading only the Excel files in the folder
FileList_xlsx = [f for f in files if f[-4:] == "xlsx"]
print(FileList_xlsx) # Prints list of Excel files in the folder

### Initializing Data Frame 
df = pd.DataFrame()

### Read Excel files into Python
for f in FileList_xlsx:
    test_df = pd.read_excel(f)
    df = df.append(test_df, ignore_index=True, sort = False)

### Code above only gives 3 columnns in the dataframe (as shown in the example below), 
### when I want 6 columns, even if they're duplicated in another Excel file in the folder
### I'm reading from.

What I'm getting:我得到了什么:

column1第 1 列 column2第 2 列 column3第 3 列
File 1文件 1 File 1文件 1
File 1文件 1 File 1文件 1
File 2文件 2 File 2文件 2
File 2文件 2 File 2文件 2
File 3文件 3 File 3文件 3
File 3文件 3 File 3文件 3

What I want:我想要的是:

column1第 1 列 column2第 2 列 column2第 2 列 column3第 3 列 column1第 1 列 column3第 3 列
File 1文件 1 File 1文件 1
File 1文件 1 File 1文件 1
File 2文件 2 File 2文件 2
File 2文件 2 File 2文件 2
File 3文件 3 File 3文件 3
File 3文件 3 File 3文件 3

Tip: To keep things simple, I made the data cells the appropriate file is supposed to be in.提示:为简单起见,我将数据单元格设置为应包含的相应文件。

Bonus: If you can also help with adding the option of populating the rest of the empty cells with NAs, that would be useful too!奖励:如果您还可以帮助添加使用 NA 填充空单元格的 rest 的选项,那也很有用!

Let me know if you're having trouble understanding my question, and I'll do my best to clarify.如果您无法理解我的问题,请告诉我,我会尽力澄清。

You could use df.add_prefix() to add a prefix to each column in the test_df as you load it and then append this to the df:您可以使用 df.add_prefix() 在加载时为 test_df 中的每一列添加一个前缀,然后将 append 这个添加到 df:

n = 1
for f in FileList:
    test_df = pd.read_excel(f)
    test_df = test_df.add_prefix(f'File{n}.')
    df = df.append(test_df, ignore_index=True, sort = False)
    n+=1

This will give you a unique column for each column you load.这将为您加载的每一列提供一个唯一的列。 Empty cells will be np.NaN.空单元格将是 np.NaN。

File1.column1 | File1.column1 | File1.column2 | File1.column2 | File2.column1 | File2.column1 | File2.column2...文件 2.列 2...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较季度数据:在 Python(Pandas) 中迭代以比较来自四个不同 excel 文件的多列,这些文件导入为 dataframe - Comparing quarterly data: Iteration in Python(Pandas) to compare multiple columns from four different excel files imported as dataframe Pandas dataframe:根据其他列的数据创建新列 - Pandas dataframe: Creating a new column based on data from other columns 将具有多个 excel 文件和多个选项卡的文件夹中的所有电子邮件提取到 pandas dataframe 中 Z23EEEB4347BDD2556DZ3EEEB4347BDD256BDZ - Extract all emails from a folder with multiple excel files and multiple tabs into a pandas dataframe in python 根据Python中DataFrame中的列表重命名所有Excel文件 - Renaming all the excel files as per the list in DataFrame in Python 从 pandas dataframe 中的多个 excel 文件中提取数据 - Data Extraction from multiple excel files in pandas dataframe 从多个Excel模板文件创建熊猫数据库-Python 3 - Creating pandas database from multiple excel template files - python 3 从多个Excel文件创建熊猫数据框 - Creating Pandas Data Frame from Multiple Excel Files 从 excel 文件中通过 Pandas 数据框生成查询 - generating query by pandas dataframe from excel files Python: How to copy Excel worksheet from multiple Excel files to one Excel file that contains all the worksheets from other Excel files - Python: How to copy Excel worksheet from multiple Excel files to one Excel file that contains all the worksheets from other Excel files Python 创建 excel 文件 - Python creating excel files
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM