[英]Unicode CSV Python
I'm not able to get this right. 我无法做到这一点。 I've a CSV file which has already encoded characters in it (I made a smaller CSV file to test, but the original is way longer):
我有一个已经在其中编码过字符的CSV文件(我制作了一个较小的CSV文件进行测试,但原始版本更长):
Isten H\\xe1ta M\\xf6g\\xf6tt Isten H \\ xe1ta M \\ xf6g \\ xf6tt
Sigur R\\xf3s Sigur R \\ xf3s
\\xd3lafur \\ xd3lafur
I can't get these strings to be decoded. 我无法解码这些字符串。 I tried decoding it by simple reading the line and then do line.decode('latin1'), but it doesn't seem to work.
我尝试通过简单的读取行来解码它,然后执行line.decode('latin1'),但它似乎不起作用。 When I looked at the raw string, I noticed that the characters are being escaped by an extra backslash.
当我查看原始字符串时,我注意到字符被额外的反斜杠转义。 So, I tried to do an unicode-escape on the raw string first before doing the decoding;
所以,我尝试在进行解码之前首先对原始字符串执行unicode-escape; this also doesn't seem to work.
这似乎也不起作用。 The string stays the way it is (got the extra backslash removed though in the raw string).
字符串保持原样(在原始字符串中删除了额外的反斜杠)。
When I hard-code a manual list with the example items, then the decoding works and I get the right characters back. 当我使用示例项硬编码手动列表时,解码工作,我得到正确的字符。
So, I only don't get it to work when I read it in from a CSV file. 因此,当我从CSV文件中读取它时,我只是不能使它工作。 Anybody has an idea where it goes wrong?
任何人都知道哪里出错了?
Characters have different representations in-memory and in a file. 字符在内存和文件中具有不同的表示形式。 A string can be encoded in several ways including a
latin-1
encoding or utf-8
but in this case where we see a literal \\xf6
, what we have is a string that's been escaped. 一个字符串可以用几种方式编码,包括
latin-1
编码或utf-8
但在这种情况下,我们看到一个文字\\xf6
,我们所拥有的是一个被转义的字符串。 We can fix that by decoding the escapes 我们可以通过解码转义来解决这个问题
>>> print open('data.csv').readline().decode('string_escape')
Isten H�ta M�g�tt
But that only gets us half way, we are still encoded. 但这只会让我们走到一半,我们仍然编码。 Now a double decode
现在是双重解码
>>> print open('data.csv').readline().decode('string_escape').decode('latin1')
Isten Háta Mögött
Got it! 得到它了! The problem is in whatever wrote the file.
问题在于写入文件的任何内容。
>>> mystring = 'Sigur R\xf3s'
>>> print mystring
Sigur R�s
>>> print mystring.decode('latin-1')
Sigur Rós
Seems to work fine on python 2.7, can you show some code and the error it generates? 似乎在python 2.7上工作正常,你能展示一些代码及其产生的错误吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.