简体   繁体   中英

Multiple files with with different Columns using pandas

I have a large number of Excel files with different Columns

For Example:

File 1:

Name | sale | Tips
-------------
sam  |  9   | 7
cham |  2   | 2

File 2:

Name | sale | Items
-------------------
mini |  6    | Tshirt
Lary |  3    | Hat

Output:

Name |  sale | Items
--------------------
sam  |  9    | Nan
cham |  2    | Nan
mini |  6    | Tshirt
Lary |  3    | Hat

I have 500 files to create into one data Set

This code is working to an extent, But unless all the columns are the same.

import pandas as pd
import glob,os
import numpy as np


inputFile = 'C:/Users/Desktop/test'

all_workbooks =glob.glob(os.path.join(inputFile,'*.xlsx'))

column_list = []
for files in all_workbooks:
    
    data= pd.read_excel(files,header =0,sheet_name='sheet1')
    column_list.append(data)
    stack_np = np.vstack(column_list)
    newData = pd.DataFrame(stack_np,columns=['Name','Sale'])

print(newData)

This code works if I have the same columns in all the files.

Can anyone help me with a solution, if I have unordered columns?

You need to collect the dataframes and concatenate them at after the loop

all_dfs =[]
wanted_columns = ['Name', 'sale', 'Items']
for files in all_workbooks:
    data = pd.read_excel(files,header =0,sheet_name='sheet1')
    data = data[wanted_columns] # or skip this line to use all columns
    all_dfs.append(data)

master_df = pd.concat(all_dfs)
del all_dfs, data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM