解析目錄 Python 中的所有 XML 文件

Question

嗨，我正在嘗試使用 python 解析給定目錄中的所有 XML 文件。 我能夠一次解析一個文件，但由於文件數量眾多，這對我來說是“不可能的”，即當我預定義樹和根時它可以工作，但是當我嘗試運行所有文件時卻不能編碼。

這是我到目前為止實現的：

import xml.etree.ElementTree as ET
import os
directory = "C:/Users/danie/Desktop/NLP/blogs/"

def clean_dir(directory):
    path = os.listdir(directory)
    print(path) 
    for filename in path:
        tree = ET.parse(filename)
        root = tree.getroot()
        doc_parser(root)


post_list = []
def doc_parser(root):
    for child in root.findall('post'):
        post_list.append(child.text)

clean_dir(directory)
print(post_list[0])

我得到的錯誤如下：

  File "D:\Anaconda\envs\Deep Learning New\lib\site-packages\IPython\core\interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-91-fce6b0119ea7>", line 1, in <module>
    runfile('C:/Users/danie/Desktop/NLP/blogs/Parser_Tes.py', wdir='C:/Users/danie/Desktop/NLP/blogs')

  File "D:\Anaconda\envs\Deep Learning New\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
    execfile(filename, namespace)

  File "D:\Anaconda\envs\Deep Learning New\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/danie/Desktop/NLP/blogs/Parser_Tes.py", line 19, in <module>
    clean_dir(directory)

  File "C:/Users/danie/Desktop/NLP/blogs/Parser_Tes.py", line 9, in clean_dir
    tree = ET.parse(filename)

  File "D:\Anaconda\envs\Deep Learning New\lib\xml\etree\ElementTree.py", line 1196, in parse
    tree.parse(source, parser)

  File "D:\Anaconda\envs\Deep Learning New\lib\xml\etree\ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)

  File "<string>", line unknown
ParseError: not well-formed (invalid token): line 103, column 225

在打印路徑方面，所有正確的文件名都被打印出來。 其中一些是：

['1000331.female.37.indUnk.Leo.xml', '1000866.female.17.Student.Libra.xml', '1004904.male.23.Arts.Capricorn.xml', '1005076.female.25.Arts.Cancer.xml', '1005545.male.25.Engineering.Sagittarius.xml', '1007188.male.48.Religion.Libra.xml', '100812.female.26.Architecture.Aries.xml', '1008329.female.16.Student.Pisces.xml', '1009572.male.25.indUnk.Cancer.xml', '1011153.female.27.Technology.Virgo.xml', '1011289.female.25.indUnk.Libra.xml', '1011311.female.17.indUnk.Scorpio.xml', '1013637.male.17.RealEstate.Virgo.xml', '1015252.female.23.indUnk.Pisces.xml', '1015556.male.34.Technology.Virgo.xml', '1016560.male.41.Publishing.Sagittarius.xml', '1016738.male.26.Publishing.Libra.xml', '1016787.female.24.Communications-Media.Leo.xml', '1019224.female.27.RealEstate.Libra.xml', '1019622.female.24.indUnk.Aquarius.xml', '1019710.male.16.Student.Pisces.xml', '1021779.female.25.indUnk.Scorpio.xml', '1022037.male.23.indUnk.Cancer.xml', '1022086.female.17.Student.Cancer.xml', '1024234.female.17.indUnk.Libra.xml', '1025783.female.17.Student.Gemini.xml', '1026164.female.23.Education.Aries.xml', '1026443.female.15.Student.Scorpio.xml', '1028027.female.16.indUnk.Libra.xml', '1028257.male.26.Education.Aries.xml', '1029959.male.17.indUnk.Aries.xml', '1031806.male.17.Technology.Sagittarius.xml', '1032153.male.27.Technology.Pisces.xml', '1032591.female.24.Banking.Aquarius.xml', '1032824.female.15.Student.Libra.xml', '1034874.female.43.Publishing.Capricorn.xml', '1039136.male.24.Student.Capricorn.xml', '1039908.female.16.indUnk.Gemini.xml', '1040084.male.17.indUnk.Taurus.xml', '1042993.male.15.Student.Sagittarius.xml', '1043329.male.23.Government.Pisces.xml', '1043569.male.26.indUnk.Virgo.xml', '1043785.female.26.Biotech.Leo.xml', '1044338.female.23.Student.Leo.xml', '1045289.female.25.Arts.Aquarius.xml', '1045316.male.27.Non-Profit.Capricorn.xml', '1045831.male.23.Student.Libra.xml', '1046946.female.25.Arts.Virgo.xml', '1047241.male.16.indUnk.Aries.xml', '1050060.female.24.Student.Pisces.xml', '1051122.female.17.Student.Libra.xml', '1052611.male.23.Student.Aries.xml', '1054833.female.24.indUnk.Scorpio.xml', '1055228.female.16.Student.Cancer.xml', '1056232.female.17.indUnk.Aquarius.xml', '1056581.female.26.indUnk.Leo.xml', ....]

所以我接受了@wundermahn 和@Kevin 的建議，並使用了 try...except。 這是現在的輸出。 即 482 來自 19320 個項目。 現在的問題是，當我嘗試從列表post_list[]打印出某個元素時。 我收到以下錯誤：

IndexError: list index out of range

有錯誤的文件：

ERROR ON FILE: 669116.female.26.indUnk.Gemini.xml
ERROR ON FILE: 669514.female.27.indUnk.Sagittarius.xml
ERROR ON FILE: 669656.female.23.Advertising.Aries.xml
ERROR ON FILE: 669719.male.26.Science.Taurus.xml
ERROR ON FILE: 669764.female.17.indUnk.Sagittarius.xml
ERROR ON FILE: 670277.female.27.Education.Sagittarius.xml
ERROR ON FILE: 670314.male.24.indUnk.Leo.xml
ERROR ON FILE: 670684.male.24.Student.Libra.xml
ERROR ON FILE: 671748.male.27.Communications-Media.Aries.xml
ERROR ON FILE: 673093.male.27.Construction.Scorpio.xml
ERROR ON FILE: 673235.male.37.Internet.Capricorn.xml
ERROR ON FILE: 67459.male.34.Arts.Capricorn.xml
ERROR ON FILE: 674684.female.23.Religion.Libra.xml

進一步檢查並打印出post_list ，由於某種原因數據沒有被附加並且它是空的。

再次感謝！

Answer 1

@Kevin 在他的評論中是正確的，這個錯誤與ElementTree對象無法正確解析文檔有關。 有些東西不是“真正的XML ”，它可能只是一個奇怪的非 unicode 字符或其他東西。

您可以嘗試做的事情是幫助調試：

import xml.etree.ElementTree as ET
import os
directory = "C:/Users/danie/Desktop/NLP/blogs/"

def clean_dir(directory):
    path = os.listdir(directory)
    print(path) 
    for filename in path:
        try:
            tree = ET.parse(filename)
            root = tree.getroot()
            doc_parser(root)
        except:
            print("ERROR ON FILE: {}".format(filename))


post_list = []
def doc_parser(root):
    for child in root.findall('post'):
        post_list.append(child.text)

clean_dir(directory)
print(post_list[0])

添加try...except語句將嘗試每個文件，如果有錯誤，則打印出導致錯誤的文件。

我沒有任何要測試的數據，但這應該可以解決錯誤。

解析目錄 Python 中的所有 XML 文件

問題描述

1 個解決方案

解決方案1
0 2019-12-16 17:02:56

解析目錄 Python 中的所有 XML 文件

問題描述

1 個解決方案

解決方案1 0 2019-12-16 17:02:56

解決方案1
0 2019-12-16 17:02:56