遍历 openpyxl 中只读工作簿中的列

Question

I have a somewhat large .xlsx file - 19 columns, 5185 rows.我有一个有点大的 .xlsx 文件 - 19 列，5185 行。 I want to open the file, read all the values in one column, do some stuff to those values, and then create a new column in the same workbook and write out the modified values.我想打开文件，读取一列中的所有值，对这些值做一些事情，然后在同一个工作簿中创建一个新列并写出修改后的值。 Thus, I need to be able to both read and write in the same file.因此，我需要能够在同一个文件中读写。

My original code did this:我的原始代码是这样做的：

def readExcel(doc):
    wb = load_workbook(generalpath + exppath + doc)
    ws = wb["Sheet1"]

    # iterate through the columns to find the correct one
    for col in ws.iter_cols(min_row=1, max_row=1):
        for mycell in col:
            if mycell.value == "PerceivedSound.RESP":
                origCol = mycell.column

    # get the column letter for the first empty column to output the new values
    newCol = utils.get_column_letter(ws.max_column+1)

    # iterate through the rows to get the value from the original column,
    # do something to that value, and output it in the new column
    for myrow in range(2, ws.max_row+1):
        myrow = str(myrow)
        # do some stuff to make the new value
        cleanedResp = doStuff(ws[origCol + myrow].value)
        ws[newCol + myrow] = cleanedResp

    wb.save(doc)

However, python threw a memory error after row 3853 because the workbook was too big.但是，python 在第 3853 行之后抛出了内存错误，因为工作簿太大了。 The openpyxl docs said to use Read-only mode ( https://openpyxl.readthedocs.io/en/latest/optimized.html ) to handle big workbooks. openpyxl 文档说使用只读模式（ https://openpyxl.readthedocs.io/en/latest/optimized.html ）来处理大工作簿。 I'm now trying to use that;我现在正在尝试使用它； however, there seems to be no way to iterate through the columns when I add the read_only = True param:但是，当我添加 read_only = True 参数时，似乎无法遍历列：

def readExcel(doc):
    wb = load_workbook(generalpath + exppath + doc, read_only=True)
    ws = wb["Sheet1"]

    for col in ws.iter_cols(min_row=1, max_row=1):
        #etc.

python throws this error: AttributeError: 'ReadOnlyWorksheet' object has no attribute 'iter_cols' python 抛出这个错误： AttributeError: 'ReadOnlyWorksheet' object has no attribute 'iter_cols'

If I change the final line in the above snippet to:如果我将上述代码段中的最后一行更改为：

for col in ws.columns:

python throws the same error: AttributeError: 'ReadOnlyWorksheet' object has no attribute 'columns' python 抛出同样的错误： AttributeError: 'ReadOnlyWorksheet' object has no attribute 'columns'

Iterating over rows is fine (and is included in the documentation I linked above):迭代行很好（并且包含在我上面链接的文档中）：

for col in ws.rows:

(no error) （没有错误）

This question asks about the AttritubeError but the solution is to remove Read-only mode, which doesn't work for me because openpyxl won't read my entire workbook in not Read-only mode. 这个问题询问了 AttritubeError 但解决方案是删除只读模式，这对我不起作用，因为 openpyxl 不会在非只读模式下读取我的整个工作簿。

So: how do I iterate through columns in a large workbook?那么：如何遍历大型工作簿中的列？

And I haven't yet encountered this, but I will once I can iterate through the columns: how do I both read and write the same workbook, if said workbook is large?而且我还没有遇到过这种情况，但是一旦我可以遍历列，我就会：如果所述工作簿很大，我如何同时读取和编写同一个工作簿？

Thanks!谢谢！

Answer 1

If the worksheet has only around 100,000 cells then you shouldn't have any memory problems.如果工作表只有大约 100,000 个单元格，那么您不应该有任何内存问题。 You should probably investigate this further.您可能应该进一步调查这一点。

iter_cols() is not available in read-only mode because it requires constant and very inefficient reparsing of the underlying XML file. iter_cols()在只读模式下不可用，因为它需要对底层 XML 文件进行持续且非常低效的重新解析。 It is however, relatively easy to convert rows into columns from iter_rows() using zip .但是，使用zip将行从iter_rows()转换为列相对容易。

def _iter_cols(self, min_col=None, max_col=None, min_row=None,
               max_row=None, values_only=False):
    yield from zip(*self.iter_rows(
        min_row=min_row, max_row=max_row,
        min_col=min_col, max_col=max_col, values_only=values_only))

import types
for sheet in workbook:
    sheet.iter_cols = types.MethodType(_iter_cols, sheet)

Answer 2

According to the documentation , ReadOnly mode only supports row-based reads (column reads are not implemented).根据文档，只读模式仅支持基于行的读取（未实现列读取）。 But that's not hard to solve:但这并不难解决：

wb2 = Workbook(write_only=True)
ws2 = wb2.create_sheet()

# find what column I need
colcounter = 0
for row in ws.rows:
    for cell in row:
        if cell.value == "PerceivedSound.RESP":
            break
        colcounter += 1
    
    # cells are apparently linked to the parent workbook meta
    # this will retain only values; you'll need custom
    # row constructor if you want to retain more

    row2 = [cell.value for cell in row]
    ws2.append(row2) # preserve the first row in the new file
    break # stop after first row

for row in ws.rows:
    row2 = [cell.value for cell in row]
    row2.append(doStuff(row2[colcounter]))
    ws2.append(row2) # write a new row to the new wb
    
wb2.save('newfile.xlsx')
wb.close()
wb2.close()

# copy `newfile.xlsx` to `generalpath + exppath + doc`
# Either using os.system,subprocess.popen, or shutil.copy2()

You will not be able to write to the same workbook, but as shown above you can open a new workbook (in writeonly mode), write to it, and overwrite the old file using OS copy.您将无法写入同一个工作簿，但如上所示，您可以打开一个新工作簿（以只写模式），写入其中，并使用操作系统副本覆盖旧文件。

遍历 openpyxl 中只读工作簿中的列

问题描述

2 个解决方案

解决方案1
3 2017-12-01 08:22:43

解决方案2
2 已采纳 2017-11-30 23:27:36

遍历 openpyxl 中只读工作簿中的列

问题描述

2 个解决方案

解决方案1 3 2017-12-01 08:22:43

解决方案2 2 已采纳 2017-11-30 23:27:36

解决方案1
3 2017-12-01 08:22:43

解决方案2
2 已采纳 2017-11-30 23:27:36