Unicode CSV Python

Question

I'm not able to get this right. 我无法做到这一点。 I've a CSV file which has already encoded characters in it (I made a smaller CSV file to test, but the original is way longer): 我有一个已经在其中编码过字符的CSV文件（我制作了一个较小的CSV文件进行测试，但原始版本更长）：

Isten H\\xe1ta M\\xf6g\\xf6tt Isten H \\ xe1ta M \\ xf6g \\ xf6tt
Sigur R\\xf3s Sigur R \\ xf3s
\\xd3lafur \\ xd3lafur

I can't get these strings to be decoded. 我无法解码这些字符串。 I tried decoding it by simple reading the line and then do line.decode('latin1'), but it doesn't seem to work. 我尝试通过简单的读取行来解码它，然后执行line.decode（'latin1'），但它似乎不起作用。 When I looked at the raw string, I noticed that the characters are being escaped by an extra backslash. 当我查看原始字符串时，我注意到字符被额外的反斜杠转义。 So, I tried to do an unicode-escape on the raw string first before doing the decoding; 所以，我尝试在进行解码之前首先对原始字符串执行unicode-escape; this also doesn't seem to work. 这似乎也不起作用。 The string stays the way it is (got the extra backslash removed though in the raw string). 字符串保持原样（在原始字符串中删除了额外的反斜杠）。

When I hard-code a manual list with the example items, then the decoding works and I get the right characters back. 当我使用示例项硬编码手动列表时，解码工作，我得到正确的字符。

So, I only don't get it to work when I read it in from a CSV file. 因此，当我从CSV文件中读取它时，我只是不能使它工作。 Anybody has an idea where it goes wrong? 任何人都知道哪里出错了？

Answer 1

Characters have different representations in-memory and in a file. 字符在内存和文件中具有不同的表示形式。 A string can be encoded in several ways including a latin-1 encoding or utf-8 but in this case where we see a literal \\xf6 , what we have is a string that's been escaped. 一个字符串可以用几种方式编码，包括latin-1编码或utf-8但在这种情况下，我们看到一个文字\\xf6 ，我们所拥有的是一个被转义的字符串。 We can fix that by decoding the escapes 我们可以通过解码转义来解决这个问题

>>> print open('data.csv').readline().decode('string_escape')
Isten H�ta M�g�tt

But that only gets us half way, we are still encoded. 但这只会让我们走到一半，我们仍然编码。 Now a double decode 现在是双重解码

>>> print open('data.csv').readline().decode('string_escape').decode('latin1')
Isten Háta Mögött

Got it! 得到它了！ The problem is in whatever wrote the file. 问题在于写入文件的任何内容。

Answer 2

>>> mystring = 'Sigur R\xf3s'
>>> print mystring
Sigur R�s
>>> print mystring.decode('latin-1')
Sigur Rós

Seems to work fine on python 2.7, can you show some code and the error it generates? 似乎在python 2.7上工作正常，你能展示一些代码及其产生的错误吗？

Unicode CSV Python

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-11-14 17:43:40

解决方案2
-1 2016-11-14 17:30:04

Unicode CSV Python

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-11-14 17:43:40

解决方案2 -1 2016-11-14 17:30:04

解决方案1
1 已采纳 2016-11-14 17:43:40

解决方案2
-1 2016-11-14 17:30:04