简体   繁体   English

迭代xlsx文件并删除unicode python openpyxl

[英]Iterating xlsx files and removing unicode python openpyxl

I'm trying to convert all of the excel files on my computer to CSV files (sheet by sheet). 我正在尝试将计算机上的所有Excel文件都转换为CSV文件(逐页)。 Some of the .xlsx files are massive (over 100MB). 一些.xlsx文件很大(超过100MB)。 I'm still having a couple issues: 我还有几个问题:
1. My function to remove non-unicode characters is very slow 1.我删除非Unicode字符的功能非常慢
2. I'm not sure that I'm using openpyxl's iteration properly, as I'm still using a lot of memory and am afraid that if I really let this thing run, it'll hit a memory error 2.我不确定我是否正在正确使用openpyxl的迭代,因为我仍在使用大量内存,并且担心如果我真的让此东西运行,它将遇到内存错误
Also, looking for any coding help in general, as I'm still very new to code in general. 另外,由于我对一般的代码还很陌生,因此通常会寻求任何编码帮助。

import csv
from formic import FileSet
from openpyxl import load_workbook
import re
from os.path import basename
import os
import string


def uclean(s): # Clean out non-unicode chars for csv.writer - SLOW
    try:
        return ''.join(char for char in s if char in string.printable).strip()
    except:
        return ''

def fclean(s): # Clean out non-filename-safe chars
    return ''.join([c for c in s if re.match(r'\w', c)])

xlsx_files = FileSet(directory='C:\\', include='**\\*.xlsx') # the whole computer's excel files
for filename in xlsx_files:
    wb = load_workbook(filename, use_iterators=True, read_only=True)  # This is still using > 600 MBs
    for sheet in wb.worksheets:
        i = wb.worksheets.index(sheet)
        bf = os.path.splitext(
            basename(filename))[0]
        sn = fclean(str(wb.get_sheet_names()[i]))
        f = bf + '_' + sn + '.csv'
        if not os.path.exists(f):
            with open(f, 'wb') as outf:
                out_writer = csv.writer(outf)
                for row in sheet.iter_rows():
                    out_writer.writerow([uclean(cell.value) for cell in row])

Using encode will be a lot faster: 使用encode会快很多:

#lines is some French text
In [80]: %timeit [s.encode('ascii', errors='ignore').strip() for s in lines]
10000 loops, best of 3: 15.3 µs per loop

In [81]: %timeit [uclean(s) for s in lines]                          
1000 loops, best of 3: 522 µs per loop

As for your openpyxl question, I'll have to get back to you -- the only thing I can think of right now is that it might be possible to load just one worksheet at a time rather than the whole workbook. 关于您的openpyxl问题,我将不得不与您联系-我现在唯一想到的是,可能一次只能加载一个工作表,而不是整个工作簿。 Keep in mind that since wb is local to the loop, it's going to be replaced with a new object each iteration, so it's not like you're going to use an additional 600mb of memory for each file. 请记住,由于wb在循环中是本地的,因此每次迭代都将用新对象替换它,因此,这与为每个文件使用额外的 600mb内存并不一样。

Read-only mode really does read cells one at a time so memory use is minimal. 只读模式确实一次只能读取一个单元,因此内存使用量最少。 However, based on you wanting to convert all the text to ascii I wonder if the reason is that there is a lot of text in the Excel files. 但是,基于您想将所有文本转换为ascii的原因,我想知道原因是否在于Excel文件中有很多文本。 Excel employs an optimisation where it stores all strings in a big list which cells reference. Excel进行了优化,将所有字符串存储在单元格引用的大列表中。 If you have a lot of unique strings then it is possible that these are the root of any memory issues as we have to keep them in memory in order to be able to read them. 如果您有很多唯一的字符串,则可能是任何内存问题的根源,因为我们必须将它们保留在内存中才能读取它们。

Regarding conversion: you can probably use a wrapper to save to UTF-8 and so remove any inline encoding whatsoever. 关于转换:您可能可以使用包装器保存到UTF-8,因此可以删除任何内联编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM