简体   繁体   English

使用 Python 解析和查询大 XML 文件

[英]Parse and Query Large XML File Using Python

I am working on a project for which I have to parse and query a relatively large xml file in python. I am using a dataset with data about scientific articles.我正在做一个项目,我必须在 python 中解析和查询一个相对较大的 xml 文件。我正在使用一个包含科学文章数据的数据集。 The dataset can be found via this link ( https://dblp.uni-trier.de/xml/dblp.xml.gz ).可以通过此链接 ( https://dblp.uni-trier.de/xml/dblp.xml.gz ) 找到数据集。 There are 7 types of entries in the dataset: article , inproceedings , proceedings , book , incollection , phdthesis and masterthesis .数据集中有 7 种类型的条目: articleinproceedingsproceedingsbookincollectionphdthesismasterthesis An entry has the following attributes: author , title , year and either journal or booktitle .条目具有以下属性: authortitleyearjournalbooktitle

I am looking for the best way to parse this and consequently perform queries on the dataset.我正在寻找解析它并随后对数据集执行查询的最佳方法。 Examples of queries that I would like to perform are:我想执行的查询示例是:

  • retrieve articles that have a certain author检索具有特定作者的文章
  • retrieve articles if the title contains a certain word如果标题包含某个词,则检索文章
  • retrieve articles to which author x and author y both contributed.检索作者 x 和作者 y 都参与过的文章。
  • ... ...

Herewith a snapshot of an entry in the xml file:在此附上 xml 文件中条目的快照:

<article mdate="2020-06-25" key="tr/meltdown/s18" publtype="informal">
<author>Paul Kocher</author>
<author>Daniel Genkin</author>
<author>Daniel Gruss</author>
<author>Werner Haas 0004</author>
<author>Mike Hamburg</author>
<author>Moritz Lipp</author>
<author>Stefan Mangard</author>
<author>Thomas Prescher 0002</author>
<author>Michael Schwarz 0001</author>
<author>Yuval Yarom</author>
<title>Spectre Attacks: Exploiting Speculative Execution.</title>
<journal>meltdownattack.com</journal>
<year>2018</year>
<ee type="oa">https://spectreattack.com/spectre.pdf</ee>
</article> 

Does anybody have an idea on how to do to this efficiently?有没有人知道如何有效地做到这一点?

I have experimented with using the ElementTree.我尝试过使用 ElementTree。 However, when parsing the file I get the following error:但是,在解析文件时出现以下错误:

xml.etree.ElementTree.ParseError: undefined entity &Ouml;: line 90, column 17

Additionally, I am not sure if using the ElementTree will be the most efficient way for querying this xml file.此外,我不确定使用 ElementTree 是否是查询此 xml 文件的最有效方式。

If the file is large, and you want to perform multiple queries, then you don't want to be parsing the file and building a tree in memory every time you do a query.如果文件很大,并且你想执行多个查询,那么你不希望每次执行查询时都解析文件并在 memory 中构建树。 You also don't want to be writing the queries in low-level Python, you need a proper query language.您也不想在低级 Python 中编写查询,您需要一种适当的查询语言。

You should be loading the data into an XML database such as BaseX or ExistDB.您应该将数据加载到 XML 数据库中,例如 BaseX 或 ExistDB。 You can then query it using XQuery.然后您可以使用 XQuery 查询它。 This will be a bit more effort to set up, but will make your life a lot easier in the long run.这将需要更多的努力来设置,但从长远来看会让你的生活更轻松。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM