简体繁体 English

如何拆分大型 XBRL 文件？

[英]How to split a large XBRL file?

原文 2019-11-14 10:44:32 5 2 python/ stream/ large-data/ xbrl/ arelle

I have xbrl file which is ~50Gb long.我有大约 50Gb 长的 xbrl 文件。 When I try to open it via arelle I got MemoryError .当我尝试通过 arelle 打开它时，我得到了MemoryError 。 Is there a way to split xbrl file into smaller pieces?有没有办法将 xbrl 文件拆分成更小的部分？ Does xbrl specification supports this? xbrl 规范是否支持这一点？

2 个解决方案

There is not an easy or standard way to split up an XBRL file into smaller pieces, although there are way that can be done.没有简单或标准的方法将 XBRL 文件拆分成更小的部分，尽管有一些方法可以做到。 You could copy batches of facts into separate files, but when doing so, you'd need to make sure that you also copy the referenced context and unit definitions for the facts.您可以将成批的事实复制到单独的文件中，但是这样做时，您需要确保您还复制了引用的上下文和事实的单元定义。 This is made trickier by the fact that the contexts and units may appear before or after the facts that reference them, so you'd probably need to do it in multiple streaming parses.由于上下文和单元可能出现在引用它们的事实之前或之后，这使得这一点变得更加棘手，因此您可能需要在多个流解析中执行此操作。

If you are generating the data yourself, I'd recommend looking at xBRL-CSV .如果您自己生成数据，我建议您查看xBRL-CSV 。 This is a new specification suited to representing large, record-based XBRL datasets in a much more compact form.这是一个新规范，适合以更紧凑的形式表示基于记录的大型 XBRL 数据集。 I believe that there is initial support for this in Arelle.我相信，Arelle 对此有初步的支持。

Let me first give a general comment from a database perspective (agnostic to XBRL).让我首先从数据库的角度（与 XBRL 无关）给出一般性评论。

When dealing with large amounts of data, it is common practice in data management to indeed split the input to multiple, smaller files (up to 100s of MB each) located in the same directory.在处理大量数据时，数据管理中的常见做法是将输入拆分为位于同一目录中的多个较小的文件（每个文件最多 100 MB）。 This is what is typically done for large datasets, with file names carrying increasing integers in the same directory.这通常用于大型数据集，文件名在同一目录中携带递增的整数。 It has practical reasons such that making it much easier to copy over the dataset to other locations.它有实际的原因，使得将数据集复制到其他位置变得更加容易。

I am not sure, however, whether there is yet a public standard for splitting XBRL instances in this way (even though this would be relatively straightforward to do and implement for an engine developer: just partition the facts and write one partition to each file with only the contexts and units in the transitive closure -- this is really a matter of standardizing the way it is done).但是，我不确定是否有以这种方式拆分 XBRL 实例的公共标准（尽管这对于引擎开发人员来说执行和实现相对简单：只需对事实进行分区并将一个分区写入每个文件只有传递闭包中的上下文和单元——这实际上是标准化完成方式的问题）。

Very large files (50GB but also more), however, can still be read in general with limited memory (say, 16GB or even less) for queries that are streaming-friendly (such as filtering, projecting, counting, converting to another format, etc).但是，对于流友好型查询（例如过滤、投影、计数、转换为另一种格式、 ETC）。

In the case of XBRL, the trick is to structure the file in such a way that it can be read in a streaming fashion, as pdw mentions.在 XBRL 的情况下，诀窍是按照 pdw 提到的方式以流式方式读取文件的方式构建文件。 I recommend looking at the following official document by XBRL International [1], which is now a candidate recommendation and which explains how to create XBRL instances that are can be read in a streaming fashion:我建议查看 XBRL International [1] 的以下官方文档，该文档现在是候选推荐，并解释了如何创建可以以流方式读取的 XBRL 实例：

[1] https://specifications.xbrl.org/work-product-index-streaming-extensions-streaming-extensions-1.0.html [1] https://specifications.xbrl.org/work-product-index-streaming-extensions-streaming-extensions-1.0.html

If the engine supports this, there is no theoretical limit to the size the instance can have, except for the capacity of your disk and how much intermediate data the query needs to maintain in memory as it streams through (for example, a grouping query aggregating on a count will need to keep track of its keys and associated counts).如果引擎支持这一点，则实例可以拥有的大小没有理论上的限制，除了磁盘容量和查询需要在 memory 中保持多少中间数据，因为它流过（例如，分组查询聚合计数将需要跟踪其键和相关计数）。 50GB is relatively on the small side compared to what can be done.与可以做的相比，50GB 相对较小。 I would still expect that it would take at least a one- or two-digit number of minutes to process depending on the exact use case.我仍然希望根据确切的用例来处理至少需要一位数或两位数的分钟数。

I am not sure whether Arelle supports streaming at this point.我不确定 Arelle 目前是否支持流媒体。 Most XBRL processors today materialize the instance in memory, but I expect that there will be some XBRL processors out there that will implement the Streaming Extensions.目前大多数 XBRL 处理器在 memory 中实现实例，但我预计会有一些 XBRL 处理器实现流式扩展。

Finally, I second pdw that reducing the size of the input such as using the CSV syntax can help on both the speed and the memory footprint.最后，我第二个 pdw 指出，减少输入的大小，例如使用 CSV 语法可以帮助提高速度和 memory 占用空间。 It is likely that a 50G XBRL instance can be stored in less than 50G of memory with the right format, and tables (CSV) are a pretty good way to do that.一个 50G XBRL 实例很可能可以以正确的格式存储在不到 50G 的 memory 中，而表格 (CSV) 是一种很好的方法。 Having said that, one should also keep in mind that the syntax used on disk does not have to match the data structures in memory, which any engine is free to design the way it sees fit as long as the outside behavior is unchanged.话虽如此，还应该记住，磁盘上使用的语法不必与 memory 中的数据结构相匹配，只要外部行为不变，任何引擎都可以自由设计它认为合适的方式。