使用异常处理以最少的方式进行Python XML解析

Question

I am in the process of stripping a couple million XMLs of sensitive data. 我正在剥离几百万个敏感数据的XML。 How can I add a try and except to get around this error which seems to have occurred because a couple of malformed xmls out to the bunch. 我该如何添加尝试，但要解决这个错误，因为几个错误的xmls出现了，这似乎已经发生了。

xml.parsers.expat.ExpatError: mismatched tag: line 1, column 28691 xml.parsers.expat.ExpatError：标记不匹配：第1行，第28691列

#!/usr/bin/python
import sys
from xml.dom import minidom

def getCleanString(word):
        str = ""
        dummy = 0
        for character in word:
                try:
                        character = character.encode('utf-8')
                        str = str + character
                except:
                        dummy += 1
        return str

def parsedelete(content):

        dom = minidom.parseString(content)

        for element in dom.getElementsByTagName('RI_RI51_ChPtIncAcctNumber'):
                parentNode = element.parentNode
                parentNode.removeChild(element)

        return dom.toxml()


for line in sys.stdin:
        if line > 1:
                line = line.strip()
                line = line.split(',', 2)
                if len(line) > 2:
                        partition = line[0]
                        id = line[1]
                        xml = line[2]
                        xml = getCleanString(xml)
                        xml = parsedelete(xml)
                        strng = '%s\t%s\t%s' %(partition, id, xml)
                        sys.stdout.write(strng + '\n')

Answer 1

Catching exceptions is straight forward. 捕获异常很简单。 Add import xml to your import statements and wrap the problem code in a try/except handler. 将import xml添加到您的import语句中，然后将问题代码包装在try / except处理程序中。

def parsedelete(content):
        try:
            dom = minidom.parseString(content)
        except xml.parsers.expat.ExpatError, e:
            # not sure how you want to handle the error... so just passing back as string
            return str(e)

        for element in dom.getElementsByTagName('RI_RI51_ChPtIncAcctNumber'):
                parentNode = element.parentNode
                parentNode.removeChild(element)

        return dom.toxml()

使用异常处理以最少的方式进行Python XML解析

问题描述

1 个解决方案

解决方案1
1 2015-02-10 22:43:33

使用异常处理以最少的方式进行Python XML解析

问题描述

1 个解决方案

解决方案1 1 2015-02-10 22:43:33

解决方案1
1 2015-02-10 22:43:33