简体   繁体   English

Unicode CSV Python

[英]Unicode CSV Python

I'm not able to get this right. 我无法做到这一点。 I've a CSV file which has already encoded characters in it (I made a smaller CSV file to test, but the original is way longer): 我有一个已经在其中编码过字符的CSV文件(我制作了一个较小的CSV文件进行测试,但原始版本更长):

Isten H\\xe1ta M\\xf6g\\xf6tt Isten H \\ xe1ta M \\ xf6g \\ xf6tt
Sigur R\\xf3s Sigur R \\ xf3s
\\xd3lafur \\ xd3lafur

I can't get these strings to be decoded. 我无法解码这些字符串。 I tried decoding it by simple reading the line and then do line.decode('latin1'), but it doesn't seem to work. 我尝试通过简单的读取行来解码它,然后执行line.decode('latin1'),但它似乎不起作用。 When I looked at the raw string, I noticed that the characters are being escaped by an extra backslash. 当我查看原始字符串时,我注意到字符被额外的反斜杠转义。 So, I tried to do an unicode-escape on the raw string first before doing the decoding; 所以,我尝试在进行解码之前首先对原始字符串执行unicode-escape; this also doesn't seem to work. 这似乎也不起作用。 The string stays the way it is (got the extra backslash removed though in the raw string). 字符串保持原样(在原始字符串中删除了额外的反斜杠)。

When I hard-code a manual list with the example items, then the decoding works and I get the right characters back. 当我使用示例项硬编码手动列表时,解码工作,我得到正确的字符。

So, I only don't get it to work when I read it in from a CSV file. 因此,当我从CSV文件中读取它时,我只是不能使它工作。 Anybody has an idea where it goes wrong? 任何人都知道哪里出错了?

Characters have different representations in-memory and in a file. 字符在内存和文件中具有不同的表示形式。 A string can be encoded in several ways including a latin-1 encoding or utf-8 but in this case where we see a literal \\xf6 , what we have is a string that's been escaped. 一个字符串可以用几种方式编码,包括latin-1编码或utf-8但在这种情况下,我们看到一个文字\\xf6 ,我们所拥有的是一个被转义的字符串。 We can fix that by decoding the escapes 我们可以通过解码转义来解决这个问题

>>> print open('data.csv').readline().decode('string_escape')
Isten H�ta M�g�tt

But that only gets us half way, we are still encoded. 但这只会让我们走到一半,我们仍然编码。 Now a double decode 现在是双重解码

>>> print open('data.csv').readline().decode('string_escape').decode('latin1')
Isten Háta Mögött

Got it! 得到它了! The problem is in whatever wrote the file. 问题在于写入文件的任何内容。

>>> mystring = 'Sigur R\xf3s'
>>> print mystring
Sigur R�s
>>> print mystring.decode('latin-1')
Sigur Rós

Seems to work fine on python 2.7, can you show some code and the error it generates? 似乎在python 2.7上工作正常,你能展示一些代码及其产生的错误吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM