使用utf-8编码/解码时出现问题

Question

我正在读一个UTF-8编码的.csv 。 我想创建一个索引并重写csv 。 索引创建为正在进行的数字和单词的第一个字母。 Python 2.7.10，Ubuntu Server

#!/usr/bin/env python
# -*- coding: utf-8 -*-
counter = 0
tempDict = {}
with open(modifiedFile, "wb") as newFile:
    with open(originalFile, "r") as file:
        for row in file:
            myList = row.split(",")
            toId = str(myList[0])

            if toId not in tempDict:
                tempDict[toId] = counter
                myId = str(toId[0]) + str(counter)
                myList.append(myId)
                counter += 1
            else:
                myId = str(toId[0]) + str(tempDict[toId])
                myList.append(myId)

            # and then I write everything into the csv
            for i, j in enumerate(myList):
                if i < 6:
                    newFile.write(str(j).strip())
                    newFile.write(",")

                else: 
                    newFile.write(str(j).strip())
                    newFile.write("\n")

问题如下。 当一个单词以花哨的字母开头时，例如

C
É
一种
...

我创建的id以?开头? ，但没有字母的字母。 奇怪的是，在我创建的csv ，带有花哨字母的单词写得正确。 没有? 或其他表示编码错误的符号。

这是为什么？

Answer 1

无论如何，除非您需要特定的传统C扩展，否则您不应该学习Python 2。

Python 3对unicode / bytes处理进行了重大更改，删除（大多数）隐式行为并使错误可见。 使用open('filename', encoding='utf-8')仍然是一种好习惯，因为默认编码依赖于环境和平台。

实际上，在Python 3中运行程序应该修复它而不做任何更改。 但这就是你的错误所在：

        toId = str(myList[0])

这是一个无操作，因为myList[0]已经是一个str 。

            myId = str(toId[0]) + str(counter)

这是一个错误： toId是包含UTF-8数据的str （字节字符串）。 你永远不会想要对UTF-8数据做任何事情，除了一次处理一个字符。

with open(originalFile, "r") as file:

这是样式错误，因为它会掩盖内置函数file 。

在Python 2下进行此操作有两个更改。

将open(filename, mode)更改为io.open(filename, mode, encoding='utf-8') 。
停止在字符串上调用str() ，因为它实际上会尝试对它们进行编码（以ASCII格式！）。

但你真的应该切换到Python 3。

有一些新的2.6和2.7旨在将差距缩小到3，其中一个是io模块，它以所有新的方式运行：Unicode文件和通用换行符。

~$ python2.7 -c 'import io,sys;print(list(io.open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
[u'\xc4\n', u'\xf9\n']
~$ python3 -c 'import sys;print(list(open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
['Ä\n', 'ù\n']

这对于编写2和3的软件都很有用。同样，编码参数是可选的，但在所有平台上，默认编码都是依赖于环境的，因此具体是很好的。

Answer 2

在python 2.x中，字符串默认为非unicode - str()返回非unicode字符串。 请改用unicode() 。

此外，您必须通过codecs.open()而不是内置的open()使用utf-8编码打开文件。

使用utf-8编码/解码时出现问题

问题描述

2 个解决方案

解决方案1
3 2017-01-23 20:22:14

解决方案2
0 已采纳 2017-01-23 18:42:22

使用utf-8编码/解码时出现问题

问题描述

2 个解决方案

解决方案1 3 2017-01-23 20:22:14

解决方案2 0 已采纳 2017-01-23 18:42:22

解决方案1
3 2017-01-23 20:22:14

解决方案2
0 已采纳 2017-01-23 18:42:22