简体   繁体   English

如何加快 python lxml 中的 xbrl 文件解析?

[英]How speed up xbrl file parsing in python lxml?

I am trying to parse xbrl file (1.35Gb) via arelle .我正在尝试通过arelle解析 xbrl 文件(1.35Gb)。 During debug I spot that execution holds on line ModelDocument.py:157 .在调试过程中,我发现执行保持在ModelDocument.py:157行。 It holds more than 30 minutes.它保持超过30分钟。 Python process take about 8Gb RAM and slowly increases memory consuming: Python 进程占用大约 8Gb RAM 并缓慢增加 memory 消耗:

在此处输入图像描述

It looks like python parses xml with 20-50Kb/s speed which is extremelly slow.看起来 python 以 20-50Kb/s 的速度解析 xml 非常慢。 Especially if we take into account that python have C optimization code.特别是如果我们考虑到 python 有C优化代码。 Note also that I got 1 core loaded 100% so CPU does some heavy work (but what exactly?)另请注意,我有 1 个核心 100% 加载,因此 CPU 做了一些繁重的工作(但究竟是什么?)

Any ideas how xbrl parsing can be speeded up?任何想法如何加快 xbrl 解析?

System: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05)系统:Windows 10、Python 3.7.3(v3.7.3:ef4ec6ed12,2019 年 3 月 25 日,22:22:05)

Maybe my answer will be more relevant, in the long term, to developers of XBRL processors, but I would encourage taking a look at what makes an instance streaming-friendly, notably, the following candidate recommendation by XBRL international:从长远来看,也许我的回答与 XBRL 处理器的开发人员更相关,但我鼓励看看是什么让实例流友好,特别是 XBRL 国际的以下候选推荐:

https://specifications.xbrl.org/work-product-index-streaming-extensions-streaming-extensions-1.0.html https://specifications.xbrl.org/work-product-index-streaming-extensions-streaming-extensions-1.0.html

Producing and consuming large XBRL instances in a streaming fashion helps avoiding the issue of being stuck at the parsing call, ie, instead of loading and parsing the entire instance in a bulk, streaming reduces pressure on memory, as the facts can be converted on the fly to the processor's internal memory structure.以流方式生产和使用大型 XBRL 实例有助于避免卡在解析调用中的问题,即不是批量加载和解析整个实例,流减少了 memory 的压力,因为事实可以在飞到处理器内部的memory结构。

In general, just streaming through 1-2 GB of data doing simple things takes much less than a minute.一般来说,仅仅通过 1-2 GB 的数据流做简单的事情需要不到一分钟的时间。 If it takes 30 minutes, it seems that there is optimization potential for the implementation of a processor.如果需要 30 分钟,看来处理器的实现有优化潜力。 I do not think that this is an issue only with Arelle, and I think that with more users opening larger files, implementors will at some point start looking into this.我不认为这只是 Arelle 的问题,而且我认为随着更多用户打开更大的文件,实施者将在某个时候开始研究这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM