简体   繁体   English

python-如何在utf8中正确编码字符串,即XML中的ISO8859-1

[英]python - How to properly encode string in utf8 which is ISO8859-1 from xml

I'm using the following code in python 2.7 to retrieve an xmlFile containing some german umlauts like ä,ü,ö,ß: 我在python 2.7中使用以下代码来检索xmlFile,其中包含一些德国变音符,例如ä,ü,ö,ß:

.....
def getXML(self,url):
    xmlFile=urllib2.urlopen(self.url)
    xmlResponse=xmlFile.read()
    xmlResponse=xmlResponse
    xmlFile.close()
    return xmlResponse
    pass

def makeDict(self, xmlFile):
    data = xmltodict.parse(xmlFile)
    return data

def saveJSON(self, dictionary):
    currentDayData=dictionary['speiseplan']['tag'][1]
    file=open('data.json','w')
    # Write the currentDay as JSON 
    file.write(json.dumps(currentDayData))
    file.close()
    return True
    pass

 # Execute
 url="path/to/xml/"
 App=GetMensaJSON(url)
 xml=GetMensaJSON.getXML(App,url)

 dictionary=GetMensaJSON.makeDict(App,xml)
 GetMensaJSON.saveJSON(App,dictionary)

The problem is that the xml File claims in its <xml> tag that it is utf-8. 问题在于xml文件在其<xml>标记中声称它是utf-8。 It however isn't. 但是事实并非如此。 By trying I found out that it is iso8859_1 So I wanted to reconvert from utf-8 to iso8859 and back to utf-8 to resolve the conflicts. 通过尝试,我发现它是iso8859_1,因此我想从utf-8转换为iso8859,然后再转换回utf-8来解决冲突。

Is there an elegant way to resolve missing umlauts? 是否有解决遗漏的变音符号的优雅方法? In my code for example I have instead of ß \Ã\Ÿ an instead of ü \Ã\¼ 例如,在我的代码中,我用ß \Ã\Ÿ代替ü \Ã\¼

I found this similar question but I can't get it to work How to parse an XML file with encoding declaration in Python? 我发现了类似的问题,但无法正常工作如何在Python中使用编码声明解析XML文件?

Also I should add that I can't influence the way I get the xml. 我还应该补充一点,我不能影响我获取xml的方式。

The XML File Link can be found in the code. 可以在代码中找到XML File Link。

The output from ´repr(xmlResponse)´ is “ repr(xmlResponse)”的输出为

 "<?xml version='1.0' encoding='utf-8'?>\n<speiseplan>\n<tag timestamp='1453676400'>\n<item language='de'>\n<category>Stammessen</category>\n<title>Gem\xc3\x83\xc2\xbcsebr\xc3\x83\xc2\xbche mit Backerbsen (25,28,31,33,34), paniertes H\xc3\x83\xc2\xa4hnchenbrustschnitzel (25) mit Paprikasauce (33,34), Pommes frites und Gem\xc3\x83\xc2\xbcse

You are trying to encode already encoded data . 您正在尝试对已经编码的数据进行编码 urllib2.urlopen() can only return you a bytestring, not unicode , so encoding makes little sense. urllib2.urlopen()只能返回一个字节串,而不是unicode ,因此编码没有意义。

What happens instead is that Python is trying to be helpful here; 相反,发生的事情是Python在这里试图提供帮助。 if you insist on encoding bytes, then it'll decode those to unicode data first . 如果您坚持对字节进行编码,那么它将首先对字节进行解码以对数据进行unicode And it'll use the default codec for that. 并且它将为此使用默认编解码器。

On top of that, XML documents are themselves responsible for documenting what codec should be used to decode. 最重要的是,XML文档本身负责记录应使用哪种编解码器进行解码。 The default codec is UTF-8, don't manually re-code the data yourself, leave that to a XML parser. 默认编解码器为UTF-8,请勿自行手动重新编码数据,而应将其留给XML解析器。

If you have Mojibake data in your XML document, the best way to fix that is to do so after parsing. 如果您的XML文档中有Mojibake数据 ,则解决此问题的最佳方法是在解析后进行修复。 I recommend the ftfy package to do this for you. 我建议使用ftfy软件包为您执行此操作。

You could manually 'fix' the encoding by first decoding as UTF-8, then encoding to Latin-1 again: 可以通过先解码为UTF-8,然后再次编码为Latin-1 手动“修复”编码:

xmlResponse = xmlFile.read().decode('utf-8').encode('latin-1')

However, this makes the assumption that your data has been badly decoded as Latin-1 to begin with; 但是,这假设您的数据一开始就被错误地解码为Latin-1。 this is not always a safe assumption. 这并不总是一个安全的假设。 If it was decoded as Windows CP 1252, for example, then the best way to recover your data is to still use ftfy . 例如,如果将其解码为Windows CP 1252,则恢复数据的最佳方法是仍然使用ftfy

You could try using ftfy before parsing as XML, but this relies on the document not having used any non-ASCII elements outside of text and attribute content: 您可以将XML解析为XML 之前尝试使用ftfy ,但这取决于文档未使用文本和属性内容之外的任何非ASCII元素:

xmlResponse = ftfy.fix_text(
    xmlFile.read().decode('utf-8'),
    fix_entities=False, uncurl_quotes=False, fix_latin_ligatures=False).encode('utf-8')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM