简体繁体 English

如何加快 python lxml 中的 xbrl 文件解析？

[英]How speed up xbrl file parsing in python lxml?

原文 2019-11-14 15:45:39 9 1 python/ lxml/ xbrl/ arelle

I am trying to parse xbrl file (1.35Gb) via arelle .我正在尝试通过arelle解析 xbrl 文件（1.35Gb）。 During debug I spot that execution holds on line ModelDocument.py:157 .在调试过程中，我发现执行保持在ModelDocument.py:157行。 It holds more than 30 minutes.它保持超过30分钟。 Python process take about 8Gb RAM and slowly increases memory consuming: Python 进程占用大约 8Gb RAM 并缓慢增加 memory 消耗：

It looks like python parses xml with 20-50Kb/s speed which is extremelly slow.看起来 python 以 20-50Kb/s 的速度解析 xml 非常慢。 Especially if we take into account that python have C optimization code.特别是如果我们考虑到 python 有C优化代码。 Note also that I got 1 core loaded 100% so CPU does some heavy work (but what exactly?)另请注意，我有 1 个核心 100% 加载，因此 CPU 做了一些繁重的工作（但究竟是什么？）

Any ideas how xbrl parsing can be speeded up?任何想法如何加快 xbrl 解析？

System: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05)系统：Windows 10、Python 3.7.3（v3.7.3:ef4ec6ed12，2019 年 3 月 25 日，22:22:05）

1 个解决方案

Maybe my answer will be more relevant, in the long term, to developers of XBRL processors, but I would encourage taking a look at what makes an instance streaming-friendly, notably, the following candidate recommendation by XBRL international:从长远来看，也许我的回答与 XBRL 处理器的开发人员更相关，但我鼓励看看是什么让实例流友好，特别是 XBRL 国际的以下候选推荐：

https://specifications.xbrl.org/work-product-index-streaming-extensions-streaming-extensions-1.0.html https://specifications.xbrl.org/work-product-index-streaming-extensions-streaming-extensions-1.0.html

Producing and consuming large XBRL instances in a streaming fashion helps avoiding the issue of being stuck at the parsing call, ie, instead of loading and parsing the entire instance in a bulk, streaming reduces pressure on memory, as the facts can be converted on the fly to the processor's internal memory structure.以流方式生产和使用大型 XBRL 实例有助于避免卡在解析调用中的问题，即不是批量加载和解析整个实例，流减少了 memory 的压力，因为事实可以在飞到处理器内部的memory结构。

In general, just streaming through 1-2 GB of data doing simple things takes much less than a minute.一般来说，仅仅通过 1-2 GB 的数据流做简单的事情需要不到一分钟的时间。 If it takes 30 minutes, it seems that there is optimization potential for the implementation of a processor.如果需要 30 分钟，看来处理器的实现有优化潜力。 I do not think that this is an issue only with Arelle, and I think that with more users opening larger files, implementors will at some point start looking into this.我不认为这只是 Arelle 的问题，而且我认为随着更多用户打开更大的文件，实施者将在某个时候开始研究这个问题。