迭代xlsx文件並刪除unicode python openpyxl

Question

我正在嘗試將計算機上的所有Excel文件都轉換為CSV文件（逐頁）。 一些.xlsx文件很大（超過100MB）。 我還有幾個問題：
1.我刪除非Unicode字符的功能非常慢
2.我不確定我是否正在正確使用openpyxl的迭代，因為我仍在使用大量內存，並且擔心如果我真的讓此東西運行，它將遇到內存錯誤
另外，由於我對一般的代碼還很陌生，因此通常會尋求任何編碼幫助。

import csv
from formic import FileSet
from openpyxl import load_workbook
import re
from os.path import basename
import os
import string


def uclean(s): # Clean out non-unicode chars for csv.writer - SLOW
    try:
        return ''.join(char for char in s if char in string.printable).strip()
    except:
        return ''

def fclean(s): # Clean out non-filename-safe chars
    return ''.join([c for c in s if re.match(r'\w', c)])

xlsx_files = FileSet(directory='C:\\', include='**\\*.xlsx') # the whole computer's excel files
for filename in xlsx_files:
    wb = load_workbook(filename, use_iterators=True, read_only=True)  # This is still using > 600 MBs
    for sheet in wb.worksheets:
        i = wb.worksheets.index(sheet)
        bf = os.path.splitext(
            basename(filename))[0]
        sn = fclean(str(wb.get_sheet_names()[i]))
        f = bf + '_' + sn + '.csv'
        if not os.path.exists(f):
            with open(f, 'wb') as outf:
                out_writer = csv.writer(outf)
                for row in sheet.iter_rows():
                    out_writer.writerow([uclean(cell.value) for cell in row])

Answer 1

使用encode會快很多：

#lines is some French text
In [80]: %timeit [s.encode('ascii', errors='ignore').strip() for s in lines]
10000 loops, best of 3: 15.3 µs per loop

In [81]: %timeit [uclean(s) for s in lines]                          
1000 loops, best of 3: 522 µs per loop

關於您的openpyxl問題，我將不得不與您聯系-我現在唯一想到的是，可能一次只能加載一個工作表，而不是整個工作簿。 請記住，由於wb在循環中是本地的，因此每次迭代都將用新對象替換它，因此，這與為每個文件使用額外的 600mb內存並不一樣。

Answer 2

只讀模式確實一次只能讀取一個單元，因此內存使用量最少。 但是，基於您想將所有文本轉換為ascii的原因，我想知道原因是否在於Excel文件中有很多文本。 Excel進行了優化，將所有字符串存儲在單元格引用的大列表中。 如果您有很多唯一的字符串，則可能是任何內存問題的根源，因為我們必須將它們保留在內存中才能讀取它們。

關於轉換：您可能可以使用包裝器保存到UTF-8，因此可以刪除任何內聯編碼。

迭代xlsx文件並刪除unicode python openpyxl

問題描述

2 個解決方案

解決方案1
1 已采納 2015-04-19 23:49:39

解決方案2
1 2015-04-20 09:04:55

迭代xlsx文件並刪除unicode python openpyxl

問題描述

2 個解決方案

解決方案1 1 已采納 2015-04-19 23:49:39

解決方案2 1 2015-04-20 09:04:55

解決方案1
1 已采納 2015-04-19 23:49:39

解決方案2
1 2015-04-20 09:04:55