[英]How to iterate through XML file efficiently in python with spacy phrasematcher
我正在嘗試在這里遍歷日文到英文字典,該字典存儲為 XML 文件。 我不需要它的所有部分,我所需要的只是能夠 select 詞性並對具有給定詞性標簽的所有條目進行排序:
<pos>&n;</pos>
<pos>&vs;</pos>
有關XML 類型聲明的更多詳細信息
現在,我想知道使用給定 POS 遍歷所有條目的最佳方法是什么。 這可能會發生變化,但我只對提取某些部分感興趣,可能是這些:
<k_ele>
<keb>収集</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf05</ke_pri>
</k_ele>
<k_ele>
<keb>蒐集</keb>
</k_ele>
<k_ele>
<keb>拾集</keb>
</k_ele>
<k_ele>
<keb>収輯</keb>
</k_ele>
一些偽代碼:
For all Ichidan verbs in the XML file:
ruler.add_patterns([{"label": "ICHIDANVERB", "pattern": x.text} for x in word.kanji_forms])
ruler.add_patterns([{"label": "ICHIDANVERB", "pattern": x.text} for x in word.kana_forms])
也許可以選擇忽略 okurigana
最有效的方法是什么? 有成千上萬的條目。 非常感謝。
編輯:建議的解決方案:
import xml.etree.ElementTree as ET
path = r"C:\Users\NameRedacted\Desktop\JMdict"
tree = ET.parse(path)
print("Search the entire tree for entries with '&n;' pos")
# "noun (common) (futsuumeishi)" must be used instead of the entity version "&n;" as defined in the DTD
for entry in tree.findall("./entry/sense/[pos='noun (common) (futsuumeishi)']/.."):
for k_ele in entry.findall("./k_ele"):
for keb in k_ele.findall("./keb"):
# Do something with every keb of the k_ele
print(keb)
ruler.add_patterns([{"label": "NOUNS", "pattern": (keb)}])
for r_ele in entry.findall("./r_ele"):
for reb in k_ele.findall("./reb"):
# Do something with every reb of the r_ele
ruler.add_patterns([{"label": "NOUNS", "pattern": (reb)}])
最簡單的方法是將 XML 文件解析為內存樹並使用 XPath 查找所需的元素。 這將需要足夠的 memory,但如果需要,您可以多次查詢樹。
例子:
import xml.etree.ElementTree as ET
tree = ET.parse('JMdict_e')
print("Search the entire tree for entries with '&n;' pos")
# "noun (common) (futsuumeishi)" must be used instead of the entity version "&n;" as defined in the DTD
for entry in tree.findall("./entry/sense/[pos='noun (common) (futsuumeishi)']/.."):
# Do something with every entry
for k_ele in entry.findall("./k_ele"):
# Do something with every k_ele of the entry
for keb in k_ele.findall("./keb"):
# Do something with every keb of the k_ele
pass
for ke_pri in k_ele.findall("./ke_pri"):
# Do something with every ke_pri of the k_ele
pass
# Delete the tree when no longer needed to release the memory
del tree
xml.etree.ElementTree 的文檔顯示了支持的 XPath 語法。
在這個 colab中查看演示。 在此測試中,XML 大小為 51 MB(僅限英文翻譯),在將文件解析到內存樹后,memory 增加了約 500 MB。 解析樹大約需要 4 秒,查詢它大約需要 3 秒。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.