简体繁体 English

索引大型XML文件

[英]Indexing a large XML file

原文 2018-02-22 09:12:47 1 2 c#/ xml/ performance/ indexing/ bigdata

Given a large (74GB) XML file, I need to read specific XML nodes by a given Alphanumeric ID. 给定一个较大的（74GB）XML文件，我需要通过给定的字母数字ID读取特定的XML节点。 It takes too long to read from top-to-bottom of the file looking for the ID. 从文件的顶部到底部读取ID所需的时间太长。

Is there an analogy of an Index for XML files like there is for relational databases?, I imagine a small Index file, where the Alphanumeric ID is quick to find, and points to the location in the larger file. 有没有像关系数据库一样有XML文件索引的类比？我想象一个小的索引文件，可以在其中快速找到字母数字ID，并指向较大文件中的位置。

Do Index files for XML exist?, how can they be implemented in C#? XML的索引文件是否存在？如何在C＃中实现？

2 个解决方案

XML databases such as BaseX, eXistDB, or MarkLogic do what you are looking for: they load XML documents into a persistent form on disk and allow fast access to parts of the document by use of indexes. XML数据库（例如BaseX，eXistDB或MarkLogic）可以满足您的需求：它们将XML文档加载到磁盘上的持久形式中，并允许使用索引快速访问文档的各个部分。

Some XML databases are optimized for handling many small documents, others are able to handle a small number of large documents, so choose your product carefully (I can't advise you on this), and consider breaking the document up into smaller parts as it is loaded. 一些XML数据库已针对处理许多小文档进行了优化，其他XML数据库则能够处理少量的大文档，因此请谨慎选择产品（我无法为您提供建议），并考虑将其分解成较小的部分已加载。

If you need to split the large document into lots of small documents, consider a streaming XSLT 3.0 processor such as Saxon-EE. 如果需要将大文档拆分为许多小文档，请考虑使用流式XSLT 3.0处理器，例如Saxon-EE。 I would expect that processing 75Gb should take about an hour: dependent, obviously, on the speed of your machine. 我希望处理75Gb大约需要一个小时：显然，这取决于计算机的速度。

No, that is beyond of the scope of what XML tries to achieve. 不，这超出了XML试图实现的范围。 If the XML does not change often and your read from it a lot, I would propose rewriting its content into a local SQLite DB once-per-change and then reading from the database instead. 如果XML不会经常更改并且您从中读取很多，我建议每次更改一次将其内容重写到本地SQLite DB中，然后从数据库中读取。 When doing the rewriting, remember that SAX-style XML reading is your friend in the case of huge files like this. 进行重写时，请记住，对于像这样的大文件，SAX风格的XML读取是您的朋友。

Theoretically, you can create a sort-of index by remembering location of already discovered IDs and then parse on your own, but that would be very brittle. 从理论上讲，您可以通过记住已发现的ID的位置来创建排序索引，然后自行解析，但这将非常脆弱。 XML si not simple enough for you to parse it on your own and hope you will be standard compliant. XML不够简单，您无法自行解析它，并希望您符合标准。

Of course, I suppose here that you can't do anything with the larger design itself: as others noted, the size of that file suggests that there is an architectural problem. 当然，我在这里假设您不能对较大的设计本身做任何事情：正如其他人指出的那样，该文件的大小表明存在体系结构问题。