简体   繁体   English

只读模式下的openpyxl性能

[英]openpyxl performance in read-only mode

I have a question about the performance of openpyxl when reading files.我对openpyxl在读取文件时的性能有疑问。

I am trying to read the same xlsx file using ProcessPoolExecutor, single file Maybe 500,000 to 800,000 rows.我正在尝试使用 ProcessPoolExecutor 读取相同的 xlsx 文件,单个文件可能有 500,000 到 800,000 行。

In read-only mode calling sheet.iter_rows(), when not using ProcessPoolExecutor, reading the entire worksheet, it takes about 1s to process 10,000 rows of data.在只读模式调用 sheet.iter_rows(),不使用 ProcessPoolExecutor 时,读取整个工作表,处理 10000 行数据大约需要 1s。 But when I set the max_row and min_row parameters with ProcessPoolExecutor, it is different.但是当我用 ProcessPoolExecutor 设置 max_row 和 min_row 参数时,情况就不同了。

totalRows: 200,000
1 ~ 10000 take 1.03s
10001 ~ 20000 take 1.73s
20001 ~ 30000 take 2.41s
30001 ~ 40000 take 3.27s
40001 ~ 50000 take 4.06s
50001 ~ 60000 take 4.85s
60001 ~ 70000 take 5.93s
70001 ~ 80000 take 6.64s
80001 ~ 90000 take 7.72s
90001 ~ 100000 take 8.18s
100001 ~ 110000 take 9.42s
110001 ~ 120000 take 10.04s
120001 ~ 130000 take 10.61s
130001 ~ 140000 take 11.17s
140001 ~ 150000 take 11.52s
150001 ~ 160000 take 12.48s
160001 ~ 170000 take 12.52s
170001 ~ 180000 take 13.01s
180001 ~ 190000 take 13.25s
190001 ~ 200000 take 13.46s
total: take 33.54s

Obviously, just looking at the results of each process, the time consumed is indeed less.显然,单看每个过程的结果,消耗的时间确实更少。 But the overall time consumption has increased.但是整体时间消耗增加了。 And the further back the scope, the more time each process consumes.而且范围越靠后,每个进程消耗的时间就越多。 Read 200,000 rows with a single process only takes about 20s.单个进程读取 200,000 行只需要大约 20 秒。

I'm not very clear with iterators and haven't looked closely at the source code of openpyxl.我对迭代器不是很清楚,也没有仔细研究过openpyxl的源代码。 From the time consumption, even if the range is set, the iterator still needs to start processing from row 1, I don't know if this is the case.从耗时来看,即使设置了范围,迭代器仍然需要从第1行开始处理,不知道是不是这样。

I'm not a professional programmer, if you happen to have relevant experience, please try to be as simple as possible我不是专业程序员,如果你恰好有相关经验,请尽量简单

codes here!!!代码在这里!!!

import openpyxl
from time import perf_counter
from concurrent.futures import ProcessPoolExecutor

def read(file, minRow, maxRow):
    start = perf_counter()
    book = openpyxl.load_workbook(filename=file, read_only=True, keep_vba=False, data_only=True, keep_links=False)
    sheet = book.worksheets[0]
    val = [[cell.value for cell in row] for row in sheet.iter_rows(min_row=minRow, max_row=maxRow)]
    book.close()
    end = perf_counter()
    print(f'{minRow} ~ {maxRow}', 'take {0:.2f}s'.format(end-start))
    return val


def parallel(file: str, rowRanges: list[tuple]):
    futures = []
    with ProcessPoolExecutor(max_workers=6) as pool:
        for minRow, maxRow in rowRanges:
            futures.append(pool.submit(read, file, minRow, maxRow))
    return futures

if __name__ == '__main__':
    file = '200000.xlsx'
    start = perf_counter()
    tasks = getRowRanges(file)
    parallel(file, tasks)
    end = perf_counter()
    print('total: take {0:.2f}s'.format(end-start))

Q :问:
"... a question about the performance ..." “……关于表演的问题……”
... please try to be as simple as possible ... ...请尽量简单...

A : A :
Having 6 Ferrari sport racing cars ( ~ max_workers = 6 )拥有 6 辆法拉利跑车 ( ~ max_workers = 6 )
does not provide a warranty to move 6 drivers ( ~ The Workload )不提供移动 6 个驱动程序的保修(〜工作负载)
from start to the end从头到尾
in 1 / 6 of the time.在 1 / 6 的时间内。

That does not work,那行不通,
even if we have a 6-lane wide racing track ( which we have not ), as you have already reported, there is a bottleneck ( a 1-lane wide only bridge, on the way from the start to the end of the race ).即使我们有一条 6 车道宽的赛道(我们没有),正如您已经报告的那样,也存在瓶颈(从比赛开始到结束的路上只有 1 车道宽的桥梁) .

Actually,实际上,
there are more performance-devastating bottlenecks ( The Bridge as the main performance blocker and a few smaller, less blocking, nevertheless performance further degrading bridges ), some avoidable, some not :还有更多破坏性能的瓶颈( The Bridge作为主要的性能阻塞器,还有一些更小、更少阻塞,但性能进一步降低的),有些是可以避免的,有些则不是:

  • the file-I/O has been no faster than ~ 10k [rows/s] in a pure solo serial run在纯单人串行运行中,文件 I/O的速度不超过 ~ 10k [rows/s]
    so never expect the same speed to appear "across" the same (single, single lane) bridge ( the shared file-I/O hardware interface ) for any next, concurrently running Ferrari, competing for using the same resource, already used for the first process to read from file ( real-hardware latencies matter, a lot ... the Devil is in details )所以永远不要期望相同的速度出现在相同的(单、单车道)桥(共享文件 I/O 硬件接口)对于任何下一个同时运行的法拉利,竞争使用相同的资源,已经用于从文件中读取的第一个过程( 真正的硬件延迟很重要……魔鬼在细节中)

  • another, avoidable, degradation comes with expensive add-on costs , paid for each and every list.append() .另一个可避免的退化伴随着昂贵的附加成本,为每个list.append()支付。 Here, try to choose a different object, avoiding a list -based storage at all and pre-allocate a block-storage ( one time paid RAM-allocation costs ) having an advantage of a know size of the result-storage, and keep storing data on-the-fly, best in cache-line respectful blocks than incrementally ( might be too technical, yet if performance is to get maxed-up, these details matter )在这里,尝试选择一个不同的对象,完全避免基于list的存储,并预先分配一个块存储(一次性支付 RAM 分配成本),具有知道结果存储大小的优势,并继续存储动态数据,最好在缓存行尊重块中而不是增量(可能技术性太强,但是如果要最大化性能,这些细节很重要)

  • dual-iterator SLOC is nice for a workbook example, yet if performance is or focus, try to find another way, perhaps using even a simpler XLS-reader ( without as many machinery under the hood, as VBA interpreter et al ), which can export the row-wise consumed cells into a plain-text, that can get collected way way faster, than the as-is code did in a triplet-of-nested-iterators "syntax-sugared" SLOC双迭代器 SLOC 非常适合作为工作簿示例,但如果性能是或重点,请尝试寻找另一种方法,也许使用更简单的 XLS 阅读器(没有像 VBA 解释器等人那样多的机器),它可以将逐行消耗的单元格导出为纯文本,与嵌套迭代器“语法糖化” SLOC 中的原样代码相比,它可以更快地收集
    [ [ ... for cell in row ] for row in sheet.iterator(...) ]

  • last comes also the process instantiation costs, that enter the revised Amdahl's Law, reformulated so that it takes into account also the overheads and atomicity of (blocks of) work.最后是过程实例化成本,它进入修订后的阿姆达尔定律,重新制定,以便它也考虑到工作(块)的开销和原子性。 For ( technically independent ) details may see this and these - where interactive speedup-simulator calculators are often linked to test the principal ceiling any such parallelisation efforts will never be able to overcome.对于(技术上独立的)细节可能会看到这个这些- 交互式加速模拟器计算器通常与测试主要上限相关联,任何此类并行化努力将永远无法克服。

  • Last, but by no means the least - The MEMORY: take your .xlsx file size and multiply it by ~ 50x and next by 6 workers ~ that amount of physical memory is expected to be used ( see doc : "Memory use is fairly high in comparison with other libraries and applications and is approximately 50 times the original file size, eg 2.5 GB for a 50 MB Excel file" credit to @Charlie Clark ) If your system does not have that much physical-RAM, the O/S starts to suffocate as truing to allocate that and goes into a RAM-swap-"thrashing" mode ( moving blocks-of-RAM to disk-swap area and back and there and back, as interleaving the 6 workers going forwards in Virtual-Memory-managed address space simulated inside a small physical-RAM at awfully high (more than 5(!) orders of magnitude longer) disk-I/O latencies, trying to cross the already blocking performance bottleneck, yeah - The Bridge ... where traffic-jam is already at max, as 6 workers try to do the very same - move some more data across the even m最后,但绝不是最不重要的 - 内存:取你的 .xlsx 文件大小并将其乘以 ~ 50 倍,然后乘以 6 个工作人员 ~ 预计将使用物理内存量(请参阅文档“内存使用率相当高与其他库和应用程序相比,大约是原始文件大小的 50 倍,例如 2.5 GB 用于 50 MB Excel 文件”归功于@Charlie Clark )如果您的系统没有那么多物理 RAM,则操作系统会启动窒息,因为真的要分配它并进入 RAM-swap-“thrashing”模式(将 RAM 块移动到磁盘交换区域,然后来回移动,因为在 Virtual-Memory 中交错前进的 6 个工作人员 -在非常高(超过 5(!)个数量级以上)磁盘 I/O 延迟的情况下,在小型物理 RAM 中模拟托管地址空间,试图跨越已经阻塞的性能瓶颈,是的 - The Bridge ... -jam 已经达到最大值,因为 6 名工人试图做同样的事情 - 在偶数 m 中移动更多数据ore blocked bottleneck ) all that at awfully great latency skyrocketing jump on doing so (see URL on latencies above).矿石阻塞瓶颈)所有这一切都在非常大的延迟下飙升(请参阅上面关于延迟的 URL)。 A hint may, yet need not save us, plus this and this may reduce, better straight prevent further inefficiencies一个提示可能,但不需要拯救我们,加上这个这个可能会减少,更好的直接防止进一步的低效率

I believe to have the same problem as OP.我相信和OP有同样的问题。

The puzzling part is that once min_row and max_row is set on sheet.iter_rows() , concurrent execution does not apply anymore, as if there was some sort of global lock in effect.令人费解的是,一旦在sheet.iter_rows()上设置了min_rowmax_row ,并发执行就不再适用,就好像某种全局锁定在起作用一样。

The following code is trying to dump data from one single large sheet from an Excel file.以下代码试图从 Excel 文件中的一张大工作表中转储数据。 The idea is to take advantage of the min_row and max_row on sheet.iter_rows to lock down a reading window and ThreadPoolExecutor for concurrent execution.这个想法是利用sheet.iter_rows上的min_rowmax_row来锁定阅读窗口和ThreadPoolExecutor以进行并发执行。

# artificially create a set of row index ranges,
# 10,000 rows per set till 1,000,000th row
# something like [(1, 10_000), (10_001, 20_000), .....]
def _ranges():
    _i = 1
    _n = 10_000
    while _i <= 1_000_000:
        yield _i, _i + _n - 1
        _i += _n


def write_to_file(file, mn, mx):
    print(f'write to file {mn}-{mx}')
    wb = load_workbook(file, read_only=True
                       , data_only=True, keep_links=False, keep_vba=False)
    sheet = wb[wb.sheetnames[0]]

    out_file = _dst / f"{mn}-{mx}.txt"
    row_count = 1
    with out_file.open('w', encoding='utf8') as f:

        rows = sheet.iter_rows(values_only=True, min_row=mn, max_row=mx)

        for row in rows:
            print(f'section {mn}-{mx} write {row_count}')
            f.write(' '.join([str(c).replace('\n', ' ') for c in row]) + '\n')
            row_count += 1


def main():
    fut = []
    with futures.ThreadPoolExecutor() as ex:
        for mn, mx in _ranges():
            fut.append(ex.submit(write_to_file, _file, mn, mx))

    futures.wait(fut)

All write_to_file() do kick off all at once.所有write_to_file()都会立即启动。 在此处输入图像描述

Iteration over rows, however, seems to behave in strict sequential fashion.然而,对行的迭代似乎以严格的顺序方式运行。 在此处输入图像描述

With a little change:稍作改动:

def write_to_file(file, mn, mx):
    print(f'write to file {mn}-{mx}')
    wb = load_workbook(file, read_only=True
                       , data_only=True, keep_links=False, keep_vba=False)
    sheet = wb[wb.sheetnames[0]]

    out_file = _dst / f"{mn}-{mx}.txt"
    row_count = 1
    with out_file.open('w', encoding='utf8') as f:

        rows = sheet.iter_rows(values_only=True)
                              # ^^^^^^^^^^^^^^^^^___min_row/max_row not set

        for row in rows:
            print(f'section {mn}-{mx} write {row_count}')
            f.write(' '.join([str(c).replace('\n', ' ') for c in row]) + '\n')
            row_count += 1

Section 20001-30000 writes first!第20001-30000节先写! 在此处输入图像描述

The chaotic effect of concurrent execution takes place.并发执行的混乱效应发生了。 在此处输入图像描述

But, without min_row and max_row , there is no point to have concurrent execution at all.但是,如果没有min_rowmax_row ,就根本没有并发执行的意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM