使用 Python 解析和查询大 XML 文件

Question

I am working on a project for which I have to parse and query a relatively large xml file in python. I am using a dataset with data about scientific articles.我正在做一个项目，我必须在 python 中解析和查询一个相对较大的 xml 文件。我正在使用一个包含科学文章数据的数据集。 The dataset can be found via this link ( https://dblp.uni-trier.de/xml/dblp.xml.gz ).可以通过此链接 ( https://dblp.uni-trier.de/xml/dblp.xml.gz ) 找到数据集。 There are 7 types of entries in the dataset: article , inproceedings , proceedings , book , incollection , phdthesis and masterthesis .数据集中有 7 种类型的条目： article 、 inproceedings 、 proceedings 、 book 、 incollection 、 phdthesis和masterthesis 。 An entry has the following attributes: author , title , year and either journal or booktitle .条目具有以下属性： author 、 title 、 year和journal或booktitle 。

I am looking for the best way to parse this and consequently perform queries on the dataset.我正在寻找解析它并随后对数据集执行查询的最佳方法。 Examples of queries that I would like to perform are:我想执行的查询示例是：

retrieve articles that have a certain author检索具有特定作者的文章
retrieve articles if the title contains a certain word如果标题包含某个词，则检索文章
retrieve articles to which author x and author y both contributed.检索作者 x 和作者 y 都参与过的文章。
... ...

Herewith a snapshot of an entry in the xml file:在此附上 xml 文件中条目的快照：

<article mdate="2020-06-25" key="tr/meltdown/s18" publtype="informal">
<author>Paul Kocher</author>
<author>Daniel Genkin</author>
<author>Daniel Gruss</author>
<author>Werner Haas 0004</author>
<author>Mike Hamburg</author>
<author>Moritz Lipp</author>
<author>Stefan Mangard</author>
<author>Thomas Prescher 0002</author>
<author>Michael Schwarz 0001</author>
<author>Yuval Yarom</author>
<title>Spectre Attacks: Exploiting Speculative Execution.</title>
<journal>meltdownattack.com</journal>
<year>2018</year>
<ee type="oa">https://spectreattack.com/spectre.pdf</ee>
</article>

Does anybody have an idea on how to do to this efficiently?有没有人知道如何有效地做到这一点？

I have experimented with using the ElementTree.我尝试过使用 ElementTree。 However, when parsing the file I get the following error:但是，在解析文件时出现以下错误：

xml.etree.ElementTree.ParseError: undefined entity &Ouml;: line 90, column 17

Additionally, I am not sure if using the ElementTree will be the most efficient way for querying this xml file.此外，我不确定使用 ElementTree 是否是查询此 xml 文件的最有效方式。

Answer 1

If the file is large, and you want to perform multiple queries, then you don't want to be parsing the file and building a tree in memory every time you do a query.如果文件很大，并且你想执行多个查询，那么你不希望每次执行查询时都解析文件并在 memory 中构建树。 You also don't want to be writing the queries in low-level Python, you need a proper query language.您也不想在低级 Python 中编写查询，您需要一种适当的查询语言。

You should be loading the data into an XML database such as BaseX or ExistDB.您应该将数据加载到 XML 数据库中，例如 BaseX 或 ExistDB。 You can then query it using XQuery.然后您可以使用 XQuery 查询它。 This will be a bit more effort to set up, but will make your life a lot easier in the long run.这将需要更多的努力来设置，但从长远来看会让你的生活更轻松。

使用 Python 解析和查询大 XML 文件

问题描述

1 个解决方案

解决方案1
0 2022-11-17 11:38:12

使用 Python 解析和查询大 XML 文件

问题描述

1 个解决方案

解决方案1 0 2022-11-17 11:38:12

解决方案1
0 2022-11-17 11:38:12