简体   繁体   中英

Openpyxl using optimized write-only significantly increases the disk usage of the excel file

A large production program was exporting lots of data into a spreadsheet using Openpyxl. It was very slow for large systems (eg: 4 hours). I realized that I could use Openpyxl optimized write-only mode to significantly speed up the time (~7 seconds). As far as I can tell, this was done correctly, the spreadsheets contain the same data, and according to LibreOffice, have the exact same number of cells.

The problem lies in the disk usage of the excel file. The older, slower method used ~4MB of disk space, while the new optimized mode uses ~8MB.

What I have looked into

  • Possible difference in how string references are shared for lots of duplicate strings (which are prevalent). I found no change in the disk space used between a large amount of identical strings, and large amounts of unique strings between the two methods. (code used below)
from openpyxl import Workbook

wb = Workbook(write_only=True)
ws = wb.create_sheet(title='mem')

for irow in range(10000):
    ws.append(['hi' for i in range(200)])
wb.save('opt.xlsx')

#####################################################

wb2 = Workbook()
sheet = wb2.active
sheet.title = "mem2"

for irow in range(1, 10001):
    for column in range(1, 201):
        cell = sheet.cell(row=irow, column=column)
        cell.value = 'hi'
wb2.save('nonopt.xlsx')

Produced a spreadsheet of the same size

  • Opening the spreadsheet in LibreOffice, then saving it in XML format cuts the data size of the optimized spreadsheet to almost match the non-optimized spreadsheet.

Answers I'm looking for

Since saving the spreadsheet reduces this data size, my thoughts are that either there is some wasteful metadata, or many empty cells that LibreOffice automatically removes. Both of which I cannot understand how they are produced by simply switching to a write-only mode, and storing the values in lists for rows instead of a cell class. So I am looking for:

  • How to test my hypothesis for the data usage as I am not too sure how to go about checking these.

  • Other possibilities for why the disk usage is larger

  • If I tested the string reference incorrectly.

If it is decided that the code itself would really be needed, I can attempt to put together a small demo of it, but currently it is very intermingled with code I cannot share. The speadsheets too I cannot share. Because of these, I don't expect the answers to 100% determine the issue, but possibly lead me towards confirming it, and updating the post.

Thank you.

As of version 2.6 openpyxl uses inline strings for everything, because this allows worksheets to be streamed, which is faster and uses less memory. The XML is perfectly valid but somewhat bloated as a result and it avoids the need manage duplicate strings. MS Excel and OpenOffice have optimised libraries for strings, but this is entirely optional. The output file size is not really relevant, but it's worth noting that the file format is optimised for numbers and things like strings and dates are definitely second-class citizens.

After viewing the xml files for the spreadsheet (xlsx is a zip folder with xml files), I found that there were many empty strings being included, where they could normally just be omitted.

After further digging I found that I was padding for empty cells with '' instead of None , fixing this fixed the issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM