简体   繁体   中英

UTF-8 to XML Python cleanup function woes

with code from different sources found here I put together a cleanup function to remove illegal characters from strings to include in an XML file. Some of it is redundant is it is quite messy. None of the functions I found have worked by themselves. I still get some problems trying to import the resulting XML file and punctuation has been removed. Everything is specified as UTF-8. I am not familiar with encodings, so a more elegant function that works would be greatly appreciated, thanks.

def doclean(s):
ranges=[(0, 8),(0xb,0x1f),(0x7f,0x84),(0x86,0x9f),(0xd800,0xdfff),(0xfdd0,0xfddf),(0xfffe,0xffff)]
bad=dict.fromkeys(r for start, end in ranges for r in range(start,end+1))
s=s.translate(bad)
s=re.sub(u"[^\x01-\x7f]+",'',s)
s=re.sub(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F%s]','',s)
s=s.replace('\x0b','')
# Take anything out that isn't UTF-8:
s=s.decode('utf-8','ignore').encode('utf-8','ignore')
return s

I wouldn't perform any filtering. I'd use encoding. That's easier and less error prone. Here is a fragment:

def encodeForXML(inputData):
    assert isinstance(inputData, str)
    outputData = ""
    for c in inputData:
        if c == '<':
            outputData += "&lt;"
        elif c == '&':
            outputDdata += "&amp;"
        else:
            outputData += c
    return outputData

Then you can wrap it into "XML":

return "<?xml version=\"1.0\" encoding=\"UTF-8\"?><myRootNode>" + encodeForXML("foo") + "</myRootNode>"

If I remember correctly that should very much do it: The resulting text should be valid XML all the time.

Additional remark: Of course this example is addressing Python 3. Python 2 is quite outdated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM