使用 Python 解析和查询大 XML 文件

Question

我正在做一个项目，我必须在 python 中解析和查询一个相对较大的 xml 文件。我正在使用一个包含科学文章数据的数据集。 可以通过此链接 ( https://dblp.uni-trier.de/xml/dblp.xml.gz ) 找到数据集。 数据集中有 7 种类型的条目： article 、 inproceedings 、 proceedings 、 book 、 incollection 、 phdthesis和masterthesis 。 条目具有以下属性： author 、 title 、 year和journal或booktitle 。

我正在寻找解析它并随后对数据集执行查询的最佳方法。 我想执行的查询示例是：

检索具有特定作者的文章
如果标题包含某个词，则检索文章
检索作者 x 和作者 y 都参与过的文章。
...

在此附上 xml 文件中条目的快照：

<article mdate="2020-06-25" key="tr/meltdown/s18" publtype="informal">
<author>Paul Kocher</author>
<author>Daniel Genkin</author>
<author>Daniel Gruss</author>
<author>Werner Haas 0004</author>
<author>Mike Hamburg</author>
<author>Moritz Lipp</author>
<author>Stefan Mangard</author>
<author>Thomas Prescher 0002</author>
<author>Michael Schwarz 0001</author>
<author>Yuval Yarom</author>
<title>Spectre Attacks: Exploiting Speculative Execution.</title>
<journal>meltdownattack.com</journal>
<year>2018</year>
<ee type="oa">https://spectreattack.com/spectre.pdf</ee>
</article>

有没有人知道如何有效地做到这一点？

我尝试过使用 ElementTree。 但是，在解析文件时出现以下错误：

xml.etree.ElementTree.ParseError: undefined entity &Ouml;: line 90, column 17

此外，我不确定使用 ElementTree 是否是查询此 xml 文件的最有效方式。

Answer 1

如果文件很大，并且你想执行多个查询，那么你不希望每次执行查询时都解析文件并在 memory 中构建树。 您也不想在低级 Python 中编写查询，您需要一种适当的查询语言。

您应该将数据加载到 XML 数据库中，例如 BaseX 或 ExistDB。 然后您可以使用 XQuery 查询它。 这将需要更多的努力来设置，但从长远来看会让你的生活更轻松。

使用 Python 解析和查询大 XML 文件

问题描述

1 个解决方案

解决方案1
0 2022-11-17 11:38:12

使用 Python 解析和查询大 XML 文件

问题描述

1 个解决方案

解决方案1 0 2022-11-17 11:38:12

解决方案1
0 2022-11-17 11:38:12