简体   繁体   中英

Appending all .xls files in a folder into one .csv file using Python

I have a folder with multiple .xls files that I wanna combine/append into one csv file.

I have script ready for it but the issue i am facing is that the output is not in the same order as the input, if the column names in any of the file is different then, the output has a new column eg:

 Input1                      Input2

col1  col2   col3            col1 col2 Col3
a1    a2     a3              b1   b2   b3

Output shown                    Output Expected

col1 col2 col3 Col3             col1 col2 col3
a1   a2   a3   NAN              a1   a2   a3
b1   b2   NAN  b3               b1   b2   b3

My current script:

#******************************************************************************************#
#*                        IMPORTING LIBRARIES                                             *#
#******************************************************************************************#

import os as os
import pandas as pd
import time 

start_time = time.time()

#******************************************************************************************#
#             MERGING ALL THE FILES INTO ONE DATAFRAME                                    *#
#******************************************************************************************# 

input_path = os.getcwd()
files = os.listdir(input_path)
files

#******************************************************************************************#
#                        PICKING OUT ALL THE .XLS FILES                                   *#
#******************************************************************************************# 

files_xls = [f for f in files if f[-3:] in ('xls', '.xlsx', '.csv') ] 
files_xls

#******************************************************************************************#
#                    APPENDING THE FILES INTO ONE DATAFRAME                               *#
#******************************************************************************************# 

#Initializing one empty dataframe
 master = pd.DataFrame()

for f in files_xls:
    data = pd.read_excel(f)
     master = master.append(data, ignore_index=True)

#******************************************************************************************#
#                   EXPORTING THE FILE INTO INTERIM LOCATION                              *#
#******************************************************************************************# 

master.to_csv(input_path+'master.csv')


# Printing time taken in seconds
print("--- %s seconds ---" % (time.time() - start_time))    

Incase you need the input then you can find this here (download anyfile):

https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009     

Try loading the files without a header and apply a custom one, it should look something in the lines of this:

for f in files_xls:
    data = pd.read_excel(f,header=None,skiprows=0)
    master = master.append(data, ignore_index=True)
master.columns = ['col1','col2',col3']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM