简体   繁体   English

使用ElementTree从文件夹中读取多个xml文件

[英]Read multiple xml file from a folder using ElementTree

I am very new in coding in Python, and there is an issue I have been trying to solve for some hours:我对 Python 编码非常陌生,并且几个小时以来我一直试图解决一个问题:

I have 1600+ xml files (0000.xml, 0001.xml, etc) need to be parsed in order to do a text mining project.我有 1600 多个 xml 文件(0000.xml、0001.xml 等)需要解析才能进行文本挖掘项目。
But an error has occurred, when I have the following code:但是出现了错误,当我有以下代码时:

from os import listdir, path 
import xml.etree.ElementTree as ET

mypath = '../project/content' 
files = [f for f in listdir(mypath) if f.endswith('.xml')]

for file in files:    
    tree = ET.parse("../project/content/"+file)
    root = tree.getroot()

The error message is the following:错误消息如下:

Traceback (most recent call last):

  File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-13-cdc3ee6c3989>", line 6, in <module>
    tree = ET.parse("../project/content/"+file)

  File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
    tree.parse(source, parser)

  File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)

  File "<string>", line unknown ParseError: no element found: line 1, column 0

where did I make mistakes?我在哪里犯了错误?

Also, I want to only extract the text from one element of each xml files, is it sufficient that I simply attach this line to the code?另外,我只想从每个 xml 文件的一个元素中提取文本,我只需将此行附加到代码中就足够了吗? and moreover, how can I save each of the results to txt files?此外,如何将每个结果保存到 txt 文件?

maintext = root.find("mainText").text

Thank you very much!非常感谢!

The right way to create path elements is using join: 创建路径元素的正确方法是使用join:

Add print messages to the code before you try and create the tree. 在尝试创建树之前,将打印消息添加到代码中。

Is the XML you try parse valid? 您尝试解析的XML是否有效?

Once you solve the parsing issue you can use multiprocessing in order to parse many files at the same time. 解决解析问题后,即可使用多重处理功能来同时解析许多文件。

from os import listdir, path
import xml.etree.ElementTree as ET

mypath = '../project/content'
files = [path.join(mypath, f) for f in listdir(mypath) if f.endswith('.xml')]

for file in files:
    print(file)
    tree = ET.parse(file)
    root = tree.getroot()

I am in a similar problem and trying to process multiple XML files in one go and I need to store the processed file in a JSON file. 我遇到类似的问题,尝试一次性处理多个XML文件,我需要将处理后的文件存储在JSON文件中。 I can process the files, but I can't store the entire thing in the JSON file. 我可以处理文件,但不能将整个内容存储在JSON文件中。 It just processes 1 file and stores it to JSON. 它仅处理1个文件并将其存储到JSON。 Looks like ElementTree element is not iterable? 看起来ElementTree元素不是可迭代的? Any assistance would be appreciated. 任何援助将不胜感激。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM