简体   繁体   中英

MemoryError while converting txt to xlsx

Related questions: 1. Error in converting txt to xlsx using python

  1. Converting txt to xlsx while setting the cell property for number cells as number

My code is

    import csv
    import openpyxl

    import sys


    def convert(input_path, output_path):
        """
        Read a csv file (with no quoting), and save its contents in an excel file.
        """
        wb = openpyxl.Workbook()
        ws = wb.worksheets[0]

        with open(input_path) as f:
            reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
            for row_index, row in enumerate(reader, 1):
                for col_index, value in enumerate(row, 1):
                    ws.cell(row=row_index, column=col_index).value = value
        print 'hello world'

        wb.save(output_path)

        print 'hello world2'


    def main():
        try:
            input_path, output_path = sys.argv[1:]
        except ValueError:
            print 'Usage: python %s input_path output_path' % (sys.argv[0],)
        else:
            convert(input_path, output_path)


    if __name__ == '__main__':
        main()

This code works, except for some input files. I couldn't find what the difference is between the input txt that causes this problem and input txt that doesn't.

My first guess was encoding. I tried changing the encoding of the input file to UTF-8 and UTF-8 with BOM. But this failed.

My second guess was it used literally too much memory. But my computer has SSD with 32 GB RAM.

So perhaps this code is not fully utilizing the capacity of this RAM?

How can I fix this?

在此处输入图片说明 Edit: I added that line print 'hello world' and print 'hello world2' to check if all the parts before 'hello world' are run correctly.

I checked the code prints 'hello world', but not 'hello world2'

So, it really seems likely that wb.save(output_path)

is causing the problem.

openpyxl has optimised modes for reading and writing large files. wb = Workbook(write_only=True) will enable this.

I'd also recommend that you install lxml for speed. This is all covered in the documentation.

Below are three alternatives:

RANGE FOR LOOP

Possibly, the two enumerate() calls may have a memory footprint as indexing must occur in a nested for loop. Consider passing csv.reader content into a list (being subscriptable) and use range() . Though admittedly even this may not be efficient as starting in Python 3 each range() call (compared to deprecated xrange ) generates its own list in memory as well.

with open(input_path) as f:
  reader = csv.reader(f)

  row = []
  for data in reader:
      row.append(data)

  for i in range(len(row)):
      for j in range(len(row[0])):
          ws.cell(row=i, column=j).value = row[i][j]

OPTIMIZED WRITER

OpenPyXL even warns that scrolling through cells even without assigning values will retain them in memory. As a solution, you can use the Optimized Writer using above row list produced from csv.reader. This route appends entire rows in a write-only workbook instance:

from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()

i = 0
for irow in row:
   ws.append(['%s' % j for j in row[j]])
   i += 1

wb.save('C:\Path\To\Outputfile.xlsx') 

WIN32COM LIBRARY

Finally, consider using the built-in win32com library where you open the csv in Excel and save as an xlsx or xls workbook . Do note this package is only for Python Windows installations.

import win32com.client as win32

excel = win32.Dispatch('Excel.Application')

# OPEN CSV DIRECTLY INSIDE EXCEL
wb = excel.Workbooks.Open(input_path)
excel.Visible = False
outxl=r'C:\Path\To\Outputfile.xlsx'

# SAVE EXCEL AS xlOpenXMLWorkbook TYPE (51)
wb.SaveAs(outxl, FileFormat=51)
wb.Close(False)
excel.Quit()

Here are fews points you can consider:

  1. Check /tmp folder, default folder where tmp files for created;
  2. Your code is utilizing complete space in that folder. Either increase that folder or you can change tmp file path while creating workbook;
  3. I use in memory for performing my task and it worked.

Below is my code:

#!/usr/bin/python
import os
import csv
import io
import sys
import traceback
from xlsxwriter.workbook import Workbook


fileNames=sys.argv[1]

try:
    f=open(fileNames, mode='r')
    workbook = Workbook(fileNames[:-4] + '.xlsx',{'in_memory': True})
    worksheet = workbook.add_worksheet()
    workbook.use_zip64()
    rowCnt=0
    #Create the bold style for the header row
    for line in f:
        rowCnt = rowCnt + 1
        row = line.split("\001")
        for j in range(len(row)):
            worksheet.write(rowCnt, j, row[j].strip())
    f.close()
    workbook.close()
    print ('success')
except ValueError:
    print ('failure')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM