Openpyxl使用优化的只写显着增加excel文件的磁盘使用率

Question

A large production program was exporting lots of data into a spreadsheet using Openpyxl.一个大型生产程序使用 Openpyxl 将大量数据导出到电子表格中。 It was very slow for large systems (eg: 4 hours).对于大型系统，它非常慢（例如：4 小时）。 I realized that I could use Openpyxl optimized write-only mode to significantly speed up the time (~7 seconds).我意识到我可以使用 Openpyxl 优化的只写模式来显着加快时间（~7 秒）。 As far as I can tell, this was done correctly, the spreadsheets contain the same data, and according to LibreOffice, have the exact same number of cells.据我所知，这是正确完成的，电子表格包含相同的数据，并且根据 LibreOffice，具有完全相同数量的单元格。

The problem lies in the disk usage of the excel file.问题在于excel文件的磁盘使用情况。 The older, slower method used ~4MB of disk space, while the new optimized mode uses ~8MB.旧的、较慢的方法使用了大约 4MB 的磁盘空间，而新的优化模式使用了大约 8MB。

What I have looked into我研究了什么

Possible difference in how string references are shared for lots of duplicate strings (which are prevalent).为大量重复字符串（很普遍）共享字符串引用的方式可能存在差异。 I found no change in the disk space used between a large amount of identical strings, and large amounts of unique strings between the two methods.我发现大量相同字符串之间使用的磁盘空间没有变化，两种方法之间使用大量唯一字符串。 (code used below) （下面使用的代码）

from openpyxl import Workbook

wb = Workbook(write_only=True)
ws = wb.create_sheet(title='mem')

for irow in range(10000):
    ws.append(['hi' for i in range(200)])
wb.save('opt.xlsx')

#####################################################

wb2 = Workbook()
sheet = wb2.active
sheet.title = "mem2"

for irow in range(1, 10001):
    for column in range(1, 201):
        cell = sheet.cell(row=irow, column=column)
        cell.value = 'hi'
wb2.save('nonopt.xlsx')

Produced a spreadsheet of the same size生成相同大小的电子表格

Opening the spreadsheet in LibreOffice, then saving it in XML format cuts the data size of the optimized spreadsheet to almost match the non-optimized spreadsheet.在 LibreOffice 中打开电子表格，然后将其保存为 XML 格式，可以将优化后的电子表格的数据大小缩减为几乎与未优化的电子表格相匹配。

Answers I'm looking for我正在寻找的答案

Since saving the spreadsheet reduces this data size, my thoughts are that either there is some wasteful metadata, or many empty cells that LibreOffice automatically removes.由于保存电子表格会减少此数据大小，因此我认为要么存在一些浪费的元数据，要么 LibreOffice 自动删除了许多空单元格。 Both of which I cannot understand how they are produced by simply switching to a write-only mode, and storing the values in lists for rows instead of a cell class.我无法理解它们是如何通过简单地切换到只写模式并将值存储在行列表而不是单元类中来产生的。 So I am looking for:所以我在寻找：

How to test my hypothesis for the data usage as I am not too sure how to go about checking these.如何测试我对数据使用的假设，因为我不太确定如何检查这些。
Other possibilities for why the disk usage is larger磁盘使用量较大的其他可能性
If I tested the string reference incorrectly.如果我错误地测试了字符串引用。

If it is decided that the code itself would really be needed, I can attempt to put together a small demo of it, but currently it is very intermingled with code I cannot share.如果确定确实需要代码本身，我可以尝试将它的一个小演示放在一起，但目前它与我无法共享的代码非常混杂。 The speadsheets too I cannot share.电子表格也是我无法分享的。 Because of these, I don't expect the answers to 100% determine the issue, but possibly lead me towards confirming it, and updating the post.因此，我不希望 100% 的答案确定问题，但可能会引导我确认并更新帖子。

Thank you.谢谢你。

Answer 1

As of version 2.6 openpyxl uses inline strings for everything, because this allows worksheets to be streamed, which is faster and uses less memory.从 2.6 版开始，openpyxl 对所有内容都使用内联字符串，因为这样可以流式传输工作表，这样速度更快，占用的内存更少。 The XML is perfectly valid but somewhat bloated as a result and it avoids the need manage duplicate strings. XML 是完全有效的，但结果有些臃肿，它避免了管理重复字符串的需要。 MS Excel and OpenOffice have optimised libraries for strings, but this is entirely optional. MS Excel 和 OpenOffice 为字符串优化了库，但这完全是可选的。 The output file size is not really relevant, but it's worth noting that the file format is optimised for numbers and things like strings and dates are definitely second-class citizens.输出文件大小并不是真正相关，但值得注意的是，文件格式针对数字进行了优化，字符串和日期等内容绝对是二等公民。

Answer 2

After viewing the xml files for the spreadsheet (xlsx is a zip folder with xml files), I found that there were many empty strings being included, where they could normally just be omitted.查看电子表格的 xml 文件（xlsx 是一个包含 xml 文件的 zip 文件夹）后，我发现其中包含许多空字符串，通常可以省略它们。

After further digging I found that I was padding for empty cells with '' instead of None , fixing this fixed the issue.进一步挖掘后，我发现我正在用''而不是None填充空单元格，解决这个问题解决了这个问题。

Openpyxl使用优化的只写显着增加excel文件的磁盘使用率

问题描述

2 个解决方案

解决方案1
1 2019-09-07 11:35:07

解决方案2
0 已采纳 2019-09-11 21:29:13

Openpyxl使用优化的只写显着增加excel文件的磁盘使用率

问题描述

2 个解决方案

解决方案1 1 2019-09-07 11:35:07

解决方案2 0 已采纳 2019-09-11 21:29:13

解决方案1
1 2019-09-07 11:35:07

解决方案2
0 已采纳 2019-09-11 21:29:13