简体   繁体   中英

Python script to read multiple excel files in one directory and convert them to .csv files in another directory

I am relatively new to python and Stackoverflow but hoping anyone can shed some light of my current problem. I have a python script that takes excel files (.xls, and .xlsx) from one directory and converts them to .csv files to another directory. It works perfectly fine on my sample excel files (consisted of 4 columns and 1 row for the purpose of testing), but when I try to run my script against a different directory that has excel files (alot larger in file size) I am getting an assertion error. I have attached my code and the error. Looking forward to have some guidance on this problem. Thanks!

import os
import pandas as pd

source = "C:/.../TestFolder"
output = "C:/.../OutputCSV"

dir_list = os.listdir(source)

os.chdir(source)

for i in range(len(dir_list)):
    filename = dir_list[i]
    book = pd.ExcelFile(filename)

    #writing to csv
    if filename.endswith('.xlsx') or filename.endswith('.xls'):
        for i in range(len(book.sheet_names)):
            df = pd.read_excel(book, book.sheet_names[i])

            os.chdir(output)

            new_name = filename.split('.')[0] + str(book.sheet_names[i])+'.csv'
            df.to_csv(new_name, index = False)

        os.chdir(source)

print "New files: ", os.listdir(output)

我在终端中执行此脚本时收到的错误消息显示如下:

Since you use Windows, consider the Jet/ACE SQL engine (Windows .dll files) to query Excel workbooks and export to CSV files, bypassing needs to load/export with pandas dataframes.

Specifically, use pyodbc to make the ODBC connection to Excel files, iterate through each sheet and export to csv files using SELECT * INTO ... SQL action query. The openpyxl module is used to retrieve sheet names. Below script does not rely on relative paths so can be run from anywhere. It is assumed each Excel file has complete header columns (no missing cells in used range of top row).

import os
import pyodbc
from openpyxl import load_workbook

source = "C:/Path/To/TestFolder"
output = "C:/Path/To/OutputCSV"

dir_list = os.listdir(source)

for xlfile in dir_list:
    strfile = os.path.join(source, xlfile)

    if strfile.endswith('.xlsx') or strfile.endswith('.xls'):
        # CONNECT TO WORKBOOK
        conn = pyodbc.connect(r'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};' + \
                               'DBQ={};'.format(strfile), autocommit=True)
        # RETRIEVE WORKBOOK SHEETS
        sheets = load_workbook(filename = strfile, use_iterators = True).get_sheet_names()

        # ITERATIVELY EXPORT SHEETS TO CSV IN OUTPUT FOLDER
        for s in sheets:
            outfile = os.path.join(output, '{0}_{1}.csv'.format(xlfile.split('.')[0], s))
            if os.path.exists(outfile): os.remove(outfile)

            strSQL = " SELECT * " + \
                     " INTO [text;HDR=Yes;Database={0};CharacterSet=65001].[{1}]" + \
                     " FROM [{2}$]"            
            conn.execute(strSQL.format(output, os.path.basename(outfile, s))
        conn.close()

**Note: this process creates a schema.ini file that concatenates with each iteration. Can be deleted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM