I have a folder with multiple .xls files that I wanna combine/append into one csv file.
I have script ready for it but the issue i am facing is that the output is not in the same order as the input, if the column names in any of the file is different then, the output has a new column eg:
Input1 Input2
col1 col2 col3 col1 col2 Col3
a1 a2 a3 b1 b2 b3
Output shown Output Expected
col1 col2 col3 Col3 col1 col2 col3
a1 a2 a3 NAN a1 a2 a3
b1 b2 NAN b3 b1 b2 b3
My current script:
#******************************************************************************************#
#* IMPORTING LIBRARIES *#
#******************************************************************************************#
import os as os
import pandas as pd
import time
start_time = time.time()
#******************************************************************************************#
# MERGING ALL THE FILES INTO ONE DATAFRAME *#
#******************************************************************************************#
input_path = os.getcwd()
files = os.listdir(input_path)
files
#******************************************************************************************#
# PICKING OUT ALL THE .XLS FILES *#
#******************************************************************************************#
files_xls = [f for f in files if f[-3:] in ('xls', '.xlsx', '.csv') ]
files_xls
#******************************************************************************************#
# APPENDING THE FILES INTO ONE DATAFRAME *#
#******************************************************************************************#
#Initializing one empty dataframe
master = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f)
master = master.append(data, ignore_index=True)
#******************************************************************************************#
# EXPORTING THE FILE INTO INTERIM LOCATION *#
#******************************************************************************************#
master.to_csv(input_path+'master.csv')
# Printing time taken in seconds
print("--- %s seconds ---" % (time.time() - start_time))
Incase you need the input then you can find this here (download anyfile):
https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009
Try loading the files without a header and apply a custom one, it should look something in the lines of this:
for f in files_xls:
data = pd.read_excel(f,header=None,skiprows=0)
master = master.append(data, ignore_index=True)
master.columns = ['col1','col2',col3']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.