为什么我的 Python XML 解析器在第一个文件后中断？

Question

我正在研究一个 Python (3) XML 解析器，它应该从文件夹中的每个 xml 文件中提取特定节点的文本内容。 然后，脚本应将收集到的数据写入以制表符分隔的文本文件。 到目前为止，所有功能似乎都在工作。 该脚本从第一个文件返回我想要的所有信息，但我相信，当它开始解析第二个文件时，它总是会中断。

当它中断时，它返回“TypeError: 'str' object is not callable”。 我检查了第二个文件，发现当我从文件夹中删除第一个文件时，这些功能与第一个文件一样有效。 我对 Python/XML 很陌生。 任何建议、帮助或有用的链接将不胜感激。 谢谢！

import xml.etree.ElementTree as ET
import re
import glob
import csv
import sys

content_file = open('WWP Project/WWP_texts.txt','wt')
quotes_file = open('WWP Project/WWP_quotes.txt', 'wt')
list_of_files = glob.glob("../../../Documents/WWPtextbase/distribution/*.xml")

ns = {'wwp':'http://www.wwp.northeastern.edu/ns/textbase'}

def content(tree):
    lines = ''.join(ET.tostring(tree.getroot(),encoding='unicode',method='text')).replace('\n',' ').replace('\t',' ').strip()
    clean_lines = re.sub(' +',' ', lines)
    return clean_lines.lower()

def quotes(tree):
    quotes_list = []
    for node in tree.findall('.//wwp:quote', namespaces=ns):
        quote = ET.tostring(node,encoding='unicode',method='text')
        clean_quote = re.sub(' +',' ', quote)
        quotes_list.append(clean_quote)
    return ' '.join(str(v) for v in quotes_list).replace('\t','').replace('\n','').lower()

def pid(tree):
    for node in tree.findall('.//wwp:sourceDesc//wwp:author/wwp:persName[1]', namespaces=ns):
        pid = node.attrib.get('ref')
    return pid.replace('personography.xml#','') # will need to replace 'p:'


def trid(tree): # this function will eventually need to call OT (.//wwp:publicationStmt//wwp:idno)
    for node in tree.findall('.//wwp:sourceDesc',namespaces=ns):
        trid = node.attrib.get('n')
    return trid

content_file.write('pid' + '\t' + 'trid' + '\t' +'text' + '\n')
quotes_file.write('pid' + '\t' + 'trid' + '\t' + 'quotes' + '\n')

for file_name in list_of_files:
    file = open(file_name, 'rt')
    tree = ET.parse(file)
    file.close()
    pid = pid(tree)
    trid = trid(tree)
    content = content(tree)
    quotes = quotes(tree)
    content_file.write(pid + '\t' + trid + '\t' + content + '\n')
    quotes_file.write(pid + '\t' + trid + '\t' + quotes + '\n')

content_file.close()
quotes_file.close()

Answer 1

您正在使用它们返回的值覆盖您的函数调用。 更改函数名称应该可以修复它。

import xml.etree.ElementTree as ET
import re
import glob
import csv
import sys

content_file = open('WWP Project/WWP_texts.txt','wt')
quotes_file = open('WWP Project/WWP_quotes.txt', 'wt')
list_of_files = glob.glob("../../../Documents/WWPtextbase/distribution/*.xml")

ns = {'wwp':'http://www.wwp.northeastern.edu/ns/textbase'}

def get_content(tree):
    lines = ''.join(ET.tostring(tree.getroot(),encoding='unicode',method='text')).replace('\n',' ').replace('\t',' ').strip()
    clean_lines = re.sub(' +',' ', lines)
    return clean_lines.lower()

def get_quotes(tree):
    quotes_list = []
    for node in tree.findall('.//wwp:quote', namespaces=ns):
        quote = ET.tostring(node,encoding='unicode',method='text')
        clean_quote = re.sub(' +',' ', quote)
        quotes_list.append(clean_quote)
    return ' '.join(str(v) for v in quotes_list).replace('\t','').replace('\n','').lower()

def get_pid(tree):
    for node in tree.findall('.//wwp:sourceDesc//wwp:author/wwp:persName[1]', namespaces=ns):
        pid = node.attrib.get('ref')
    return pid.replace('personography.xml#','') # will need to replace 'p:'


def get_trid(tree): # this function will eventually need to call OT (.//wwp:publicationStmt//wwp:idno)
    for node in tree.findall('.//wwp:sourceDesc',namespaces=ns):
        trid = node.attrib.get('n')
    return trid

content_file.write('pid' + '\t' + 'trid' + '\t' +'text' + '\n')
quotes_file.write('pid' + '\t' + 'trid' + '\t' + 'quotes' + '\n')

for file_name in list_of_files:
    file = open(file_name, 'rt')
    tree = ET.parse(file)
    file.close()
    pid = get_pid(tree)
    trid = get_trid(tree)
    content = get_content(tree)
    quotes = get_quotes(tree)
    content_file.write(pid + '\t' + trid + '\t' + content + '\n')
    quotes_file.write(pid + '\t' + trid + '\t' + quotes + '\n')

content_file.close()
quotes_file.close()

为什么我的 Python XML 解析器在第一个文件后中断？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-09-21 23:39:58

为什么我的 Python XML 解析器在第一个文件后中断？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-09-21 23:39:58

解决方案1
1 已采纳 2016-09-21 23:39:58