简体   繁体   中英

python - How to properly encode string in utf8 which is ISO8859-1 from xml

I'm using the following code in python 2.7 to retrieve an xmlFile containing some german umlauts like ä,ü,ö,ß:

.....
def getXML(self,url):
    xmlFile=urllib2.urlopen(self.url)
    xmlResponse=xmlFile.read()
    xmlResponse=xmlResponse
    xmlFile.close()
    return xmlResponse
    pass

def makeDict(self, xmlFile):
    data = xmltodict.parse(xmlFile)
    return data

def saveJSON(self, dictionary):
    currentDayData=dictionary['speiseplan']['tag'][1]
    file=open('data.json','w')
    # Write the currentDay as JSON 
    file.write(json.dumps(currentDayData))
    file.close()
    return True
    pass

 # Execute
 url="path/to/xml/"
 App=GetMensaJSON(url)
 xml=GetMensaJSON.getXML(App,url)

 dictionary=GetMensaJSON.makeDict(App,xml)
 GetMensaJSON.saveJSON(App,dictionary)

The problem is that the xml File claims in its <xml> tag that it is utf-8. It however isn't. By trying I found out that it is iso8859_1 So I wanted to reconvert from utf-8 to iso8859 and back to utf-8 to resolve the conflicts.

Is there an elegant way to resolve missing umlauts? In my code for example I have instead of ß \Ã\Ÿ an instead of ü \Ã\¼

I found this similar question but I can't get it to work How to parse an XML file with encoding declaration in Python?

Also I should add that I can't influence the way I get the xml.

The XML File Link can be found in the code.

The output from ´repr(xmlResponse)´ is

 "<?xml version='1.0' encoding='utf-8'?>\n<speiseplan>\n<tag timestamp='1453676400'>\n<item language='de'>\n<category>Stammessen</category>\n<title>Gem\xc3\x83\xc2\xbcsebr\xc3\x83\xc2\xbche mit Backerbsen (25,28,31,33,34), paniertes H\xc3\x83\xc2\xa4hnchenbrustschnitzel (25) mit Paprikasauce (33,34), Pommes frites und Gem\xc3\x83\xc2\xbcse

You are trying to encode already encoded data . urllib2.urlopen() can only return you a bytestring, not unicode , so encoding makes little sense.

What happens instead is that Python is trying to be helpful here; if you insist on encoding bytes, then it'll decode those to unicode data first . And it'll use the default codec for that.

On top of that, XML documents are themselves responsible for documenting what codec should be used to decode. The default codec is UTF-8, don't manually re-code the data yourself, leave that to a XML parser.

If you have Mojibake data in your XML document, the best way to fix that is to do so after parsing. I recommend the ftfy package to do this for you.

You could manually 'fix' the encoding by first decoding as UTF-8, then encoding to Latin-1 again:

xmlResponse = xmlFile.read().decode('utf-8').encode('latin-1')

However, this makes the assumption that your data has been badly decoded as Latin-1 to begin with; this is not always a safe assumption. If it was decoded as Windows CP 1252, for example, then the best way to recover your data is to still use ftfy .

You could try using ftfy before parsing as XML, but this relies on the document not having used any non-ASCII elements outside of text and attribute content:

xmlResponse = ftfy.fix_text(
    xmlFile.read().decode('utf-8'),
    fix_entities=False, uncurl_quotes=False, fix_latin_ligatures=False).encode('utf-8')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM