使用 Python 解析和查詢大 XML 文件

Question

我正在做一個項目，我必須在 python 中解析和查詢一個相對較大的 xml 文件。我正在使用一個包含科學文章數據的數據集。 可以通過此鏈接 ( https://dblp.uni-trier.de/xml/dblp.xml.gz ) 找到數據集。 數據集中有 7 種類型的條目： article 、 inproceedings 、 proceedings 、 book 、 incollection 、 phdthesis和masterthesis 。 條目具有以下屬性： author 、 title 、 year和journal或booktitle 。

我正在尋找解析它並隨后對數據集執行查詢的最佳方法。 我想執行的查詢示例是：

檢索具有特定作者的文章
如果標題包含某個詞，則檢索文章
檢索作者 x 和作者 y 都參與過的文章。
...

在此附上 xml 文件中條目的快照：

<article mdate="2020-06-25" key="tr/meltdown/s18" publtype="informal">
<author>Paul Kocher</author>
<author>Daniel Genkin</author>
<author>Daniel Gruss</author>
<author>Werner Haas 0004</author>
<author>Mike Hamburg</author>
<author>Moritz Lipp</author>
<author>Stefan Mangard</author>
<author>Thomas Prescher 0002</author>
<author>Michael Schwarz 0001</author>
<author>Yuval Yarom</author>
<title>Spectre Attacks: Exploiting Speculative Execution.</title>
<journal>meltdownattack.com</journal>
<year>2018</year>
<ee type="oa">https://spectreattack.com/spectre.pdf</ee>
</article>

有沒有人知道如何有效地做到這一點？

我嘗試過使用 ElementTree。 但是，在解析文件時出現以下錯誤：

xml.etree.ElementTree.ParseError: undefined entity &Ouml;: line 90, column 17

此外，我不確定使用 ElementTree 是否是查詢此 xml 文件的最有效方式。

Answer 1

如果文件很大，並且你想執行多個查詢，那么你不希望每次執行查詢時都解析文件並在 memory 中構建樹。 您也不想在低級 Python 中編寫查詢，您需要一種適當的查詢語言。

您應該將數據加載到 XML 數據庫中，例如 BaseX 或 ExistDB。 然后您可以使用 XQuery 查詢它。 這將需要更多的努力來設置，但從長遠來看會讓你的生活更輕松。

使用 Python 解析和查詢大 XML 文件

問題描述

1 個解決方案

解決方案1
0 2022-11-17 11:38:12

使用 Python 解析和查詢大 XML 文件

問題描述

1 個解決方案

解決方案1 0 2022-11-17 11:38:12

解決方案1
0 2022-11-17 11:38:12