简体   繁体   English

使用utf-8编码/解码时出现问题

[英]Trouble with utf-8 encoding/decoding

I am reading a .csv which is UTF-8 encoded. 我正在读一个UTF-8编码的.csv I want to create an index and rewrite the csv . 我想创建一个索引并重写csv The index is created as an ongoing number and the first letter of a word. 索引创建为正在进行的数字单词的第一个字母。 Python 2.7.10, Ubuntu Server Python 2.7.10,Ubuntu Server

#!/usr/bin/env python
# -*- coding: utf-8 -*-
counter = 0
tempDict = {}
with open(modifiedFile, "wb") as newFile:
    with open(originalFile, "r") as file:
        for row in file:
            myList = row.split(",")
            toId = str(myList[0])

            if toId not in tempDict:
                tempDict[toId] = counter
                myId = str(toId[0]) + str(counter)
                myList.append(myId)
                counter += 1
            else:
                myId = str(toId[0]) + str(tempDict[toId])
                myList.append(myId)

            # and then I write everything into the csv
            for i, j in enumerate(myList):
                if i < 6:
                    newFile.write(str(j).strip())
                    newFile.write(",")

                else: 
                    newFile.write(str(j).strip())
                    newFile.write("\n")

The problem is the following. 问题如下。 When a word starts with a fancy letter, such as 当一个单词以花哨的字母开头时,例如

  • Č C
  • É É
  • Ā 一种
  • ... ...

The id I create starts with a ? 我创建的id以?开头? , but not with the letter of the word. ,但没有字母的字母。 The strange part is, that withing the csv I create, the words with the fancy letters are written correct. 奇怪的是,在我创建的csv ,带有花哨字母的单词写得正确。 There are no ? 没有? or other symbols which indicate a wrong encoding. 或其他表示编码错误的符号。

Why is that? 这是为什么?

By all means, you should not be learning Python 2 unless there is a specific legacy C extension that you need. 无论如何,除非您需要特定的传统C扩展,否则您不应该学习Python 2。

Python 3 makes major changes to the unicode/bytes handling that removes (most) implicit behavior and makes errors visible. Python 3对unicode / bytes处理进行了重大更改,删除(大多数)隐式行为并使错误可见。 It's still good practice to use open('filename', encoding='utf-8') since the default encoding is environment- and platform-dependent. 使用open('filename', encoding='utf-8')仍然是一种好习惯,因为默认编码依赖于环境和平台。

Indeed, running your program in Python 3 should fix it without any changes. 实际上,在Python 3中运行程序应该修复它而不做任何更改。 But here's where your bug lies: 但这就是你的错误所在:

        toId = str(myList[0])

This is a no-op, since myList[0] is already a str . 这是一个无操作,因为myList[0]已经是一个str

            myId = str(toId[0]) + str(counter)

This is a bug: toId is a str (byte string) containing UTF-8 data. 这是一个错误: toId是包含UTF-8数据的str (字节字符串)。 You never, ever want to do anything with UTF-8 data except process it one character at a time. 你永远不会想要对UTF-8数据做任何事情,除了一次处理一个字符。

with open(originalFile, "r") as file:

This is a style error, since it masks the built-in function file . 这是样式错误,因为它会掩盖内置函数file

There are two changes to make this run under Python 2. 在Python 2下进行此操作有两个更改。

  1. Change open(filename, mode) to io.open(filename, mode, encoding='utf-8') . open(filename, mode)更改为io.open(filename, mode, encoding='utf-8')
  2. Stop calling str() on strings, since that actually attempts to encode them (in ASCII!). 停止在字符串上调用str() ,因为它实际上会尝试对它们进行编码(以ASCII格式!)。

But you really should switch to Python 3. 但你真的应该切换到Python 3。

There are a few pieces new to 2.6 and 2.7 that are intended to bridge the gap to 3, and one of them is the io module, which behaves in all the nice new ways: Unicode files and universal newlines. 有一些新的2.6和2.7旨在将差距缩小到3,其中一个是io模块,它以所有新的方式运行:Unicode文件和通用换行符。

~$ python2.7 -c 'import io,sys;print(list(io.open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
[u'\xc4\n', u'\xf9\n']
~$ python3 -c 'import sys;print(list(open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
['Ä\n', 'ù\n']

This can be useful to write software for both 2 and 3. Again, the encoding argument is optional but on all platforms the default encoding is environment-dependent, so it's good to be specific. 这对于编写2和3的软件都很有用。同样,编码参数是可选的,但在所有平台上,默认编码都是依赖于环境的,因此具体是很好的。

In python 2.x strings are by default non-unicode - str() returns a non-unicode string. 在python 2.x中,字符串默认为非unicode - str()返回非unicode字符串。 Use unicode() instead. 请改用unicode()

Besides, you must open the file using utf-8 encoding through codecs.open() rather than the built-in open() . 此外,您必须通过codecs.open()而不是内置的open()使用utf-8编码打开文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM