简体   繁体   English

将 txt 转换为 xlsx 时出现内存错误

[英]MemoryError while converting txt to xlsx

Related questions: 1. Error in converting txt to xlsx using python相关问题:1. 使用python将txt转换为xlsx时出错

  1. Converting txt to xlsx while setting the cell property for number cells as number 将数字单元格的单元格属性设置为数字时将 txt 转换为 xlsx

My code is我的代码是

    import csv
    import openpyxl

    import sys


    def convert(input_path, output_path):
        """
        Read a csv file (with no quoting), and save its contents in an excel file.
        """
        wb = openpyxl.Workbook()
        ws = wb.worksheets[0]

        with open(input_path) as f:
            reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
            for row_index, row in enumerate(reader, 1):
                for col_index, value in enumerate(row, 1):
                    ws.cell(row=row_index, column=col_index).value = value
        print 'hello world'

        wb.save(output_path)

        print 'hello world2'


    def main():
        try:
            input_path, output_path = sys.argv[1:]
        except ValueError:
            print 'Usage: python %s input_path output_path' % (sys.argv[0],)
        else:
            convert(input_path, output_path)


    if __name__ == '__main__':
        main()

This code works, except for some input files.此代码有效,但某些输入文件除外。 I couldn't find what the difference is between the input txt that causes this problem and input txt that doesn't.我找不到导致此问题的输入 txt 和没有输入的 txt 之间的区别。

My first guess was encoding.我的第一个猜测是编码。 I tried changing the encoding of the input file to UTF-8 and UTF-8 with BOM.我尝试使用 BOM 将输入文件的编码更改为 UTF-8 和 UTF-8。 But this failed.但这失败了。

My second guess was it used literally too much memory.我的第二个猜测是它使用了太多的内存。 But my computer has SSD with 32 GB RAM.但我的电脑有 32 GB RAM 的 SSD。

So perhaps this code is not fully utilizing the capacity of this RAM?那么也许这段代码没有充分利用这个 RAM 的容量?

How can I fix this?我怎样才能解决这个问题?

在此处输入图片说明 Edit: I added that line print 'hello world' and print 'hello world2' to check if all the parts before 'hello world' are run correctly.编辑:我添加了那行 print 'hello world' 并打印 'hello world2' 以检查 'hello world' 之前的所有部分是否都正确运行。

I checked the code prints 'hello world', but not 'hello world2'我检查了代码打印“你好世界”,但不是“你好世界2”

So, it really seems likely that wb.save(output_path)所以,似乎 wb.save(output_path)

is causing the problem.导致问题。

openpyxl has optimised modes for reading and writing large files. openpyxl 优化了读取和写入大文件的模式。 wb = Workbook(write_only=True) will enable this. wb = Workbook(write_only=True)将启用此功能。

I'd also recommend that you install lxml for speed.我还建议您安装 lxml 以提高速度。 This is all covered in the documentation.这一切都包含在文档中。

Below are three alternatives:下面是三个备选方案:

RANGE FOR LOOP循环范围

Possibly, the two enumerate() calls may have a memory footprint as indexing must occur in a nested for loop.可能,两个enumerate()调用可能有内存占用,因为索引必须发生在嵌套的 for 循环中。 Consider passing csv.reader content into a list (being subscriptable) and use range() .考虑将 csv.reader 内容传递到列表(可下标)并使用range() Though admittedly even this may not be efficient as starting in Python 3 each range() call (compared to deprecated xrange ) generates its own list in memory as well.诚然,即使这可能效率不高,因为从 Python 3 开始,每个range()调用(与已弃用的xrange相比)也在内存中生成自己的列表。

with open(input_path) as f:
  reader = csv.reader(f)

  row = []
  for data in reader:
      row.append(data)

  for i in range(len(row)):
      for j in range(len(row[0])):
          ws.cell(row=i, column=j).value = row[i][j]

OPTIMIZED WRITER优化写入器

OpenPyXL even warns that scrolling through cells even without assigning values will retain them in memory. OpenPyXL甚至警告说,即使在没有赋值的情况下滚动单元格也会将它们保留在内存中。 As a solution, you can use the Optimized Writer using above row list produced from csv.reader.作为解决方案,您可以使用Optimized Writer使用上面从 csv.reader 生成的row列表。 This route appends entire rows in a write-only workbook instance:此路由将整行附加到只写工作簿实例中:

from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()

i = 0
for irow in row:
   ws.append(['%s' % j for j in row[j]])
   i += 1

wb.save('C:\Path\To\Outputfile.xlsx') 

WIN32COM LIBRARY WIN32COM库

Finally, consider using the built-in win32com library where you open the csv in Excel and save as an xlsx or xls workbook .最后,考虑使用内置的 win32com 库,您可以在其中在 Excel 中打开 csv 并另存为xlsx 或 xls 工作簿 Do note this package is only for Python Windows installations.请注意,此软件包仅适用于 Python Windows 安装。

import win32com.client as win32

excel = win32.Dispatch('Excel.Application')

# OPEN CSV DIRECTLY INSIDE EXCEL
wb = excel.Workbooks.Open(input_path)
excel.Visible = False
outxl=r'C:\Path\To\Outputfile.xlsx'

# SAVE EXCEL AS xlOpenXMLWorkbook TYPE (51)
wb.SaveAs(outxl, FileFormat=51)
wb.Close(False)
excel.Quit()

Here are fews points you can consider:您可以考虑以下几点:

  1. Check /tmp folder, default folder where tmp files for created;检查/tmp文件夹,创建tmp文件的默认文件夹;
  2. Your code is utilizing complete space in that folder.您的代码正在使用该文件夹中的完整空间。 Either increase that folder or you can change tmp file path while creating workbook;增加该文件夹,或者您可以在创建工作簿时更改tmp文件路径;
  3. I use in memory for performing my task and it worked.我在内存中用于执行我的任务并且它起作用了。

Below is my code:下面是我的代码:

#!/usr/bin/python
import os
import csv
import io
import sys
import traceback
from xlsxwriter.workbook import Workbook


fileNames=sys.argv[1]

try:
    f=open(fileNames, mode='r')
    workbook = Workbook(fileNames[:-4] + '.xlsx',{'in_memory': True})
    worksheet = workbook.add_worksheet()
    workbook.use_zip64()
    rowCnt=0
    #Create the bold style for the header row
    for line in f:
        rowCnt = rowCnt + 1
        row = line.split("\001")
        for j in range(len(row)):
            worksheet.write(rowCnt, j, row[j].strip())
    f.close()
    workbook.close()
    print ('success')
except ValueError:
    print ('failure')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM