简体   繁体   中英

Read multiple xml file from a folder using ElementTree

I am very new in coding in Python, and there is an issue I have been trying to solve for some hours:

I have 1600+ xml files (0000.xml, 0001.xml, etc) need to be parsed in order to do a text mining project.
But an error has occurred, when I have the following code:

from os import listdir, path 
import xml.etree.ElementTree as ET

mypath = '../project/content' 
files = [f for f in listdir(mypath) if f.endswith('.xml')]

for file in files:    
    tree = ET.parse("../project/content/"+file)
    root = tree.getroot()

The error message is the following:

Traceback (most recent call last):

  File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-13-cdc3ee6c3989>", line 6, in <module>
    tree = ET.parse("../project/content/"+file)

  File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
    tree.parse(source, parser)

  File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)

  File "<string>", line unknown ParseError: no element found: line 1, column 0

where did I make mistakes?

Also, I want to only extract the text from one element of each xml files, is it sufficient that I simply attach this line to the code? and moreover, how can I save each of the results to txt files?

maintext = root.find("mainText").text

Thank you very much!

The right way to create path elements is using join:

Add print messages to the code before you try and create the tree.

Is the XML you try parse valid?

Once you solve the parsing issue you can use multiprocessing in order to parse many files at the same time.

from os import listdir, path
import xml.etree.ElementTree as ET

mypath = '../project/content'
files = [path.join(mypath, f) for f in listdir(mypath) if f.endswith('.xml')]

for file in files:
    print(file)
    tree = ET.parse(file)
    root = tree.getroot()

I am in a similar problem and trying to process multiple XML files in one go and I need to store the processed file in a JSON file. I can process the files, but I can't store the entire thing in the JSON file. It just processes 1 file and stores it to JSON. Looks like ElementTree element is not iterable? Any assistance would be appreciated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM