带有python和fileinput的Unicode文件

Question

I am becoming more and more convinced that the business of file encodings is made as confusing as possible on purpose. 我越来越相信文件编码业务的目的是尽可能地混淆。 I have a problem with reading a file in utf-8 encoding that contains just one line: 我在使用utf-8编码读取一个只包含一行的文件时遇到问题：

“blabla this is some text”

(note that the quotation marks are some fancy version of the standard quotation marks). （请注意，引号是标准引号的一些奇特版本）。

Now, I run this piece of Python code on it: 现在，我在其上运行这段Python代码：

import fileinput
def charinput(paths):
    with open(paths) as fi:
        for line in fi:
            for char in line:
                yield char
i = charinput('path/to/file.txt')
for item in i:
    print(item)

with two results: If i run my python code from command prompt, the result is some strange characters, followed by an error mesage: 有两个结果：如果我从命令提示符运行我的python代码，结果是一些奇怪的字符，后跟一个错误消息：

ď
»
ż
â
Traceback (most recent call last):
  File "krneki.py", line 11, in <module>
    print(item)
  File "C:\Python34\lib\encodings\cp852.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position
0: character maps to <undefined>

I get the idea that the problem comes from the fact that Python tries to read a "wrongly" encoded document, but is there a way to order fileinput.input to read utf-8 ? 我认为问题来自Python尝试读取“错误”编码文档的事实，但有没有办法命令fileinput.input读取utf-8 ？

EDIT: Some really weird stuff is happening and I have NO idea how any of it works. 编辑：一些非常奇怪的东西正在发生，我不知道它是如何工作的。 After saving the same file as before in notepad++ , the python code now runs within IDLE and results in the following output (newlines removed): 在notepad++保存与以前相同的文件之后，python代码现在在IDLE中运行并导致以下输出（删除换行符）：

ď»żâ€śblabla this is some textâ€ť

while I can get the command prompt to not crash if I first input chcp 65001 . 虽然如果我第一次输入chcp 65001 ，我可以得到命令提示不崩溃。 Running the file then results in 运行该文件然后导致

ÄÂ»Å¼Ã¢â‚¬Å›blabla this is some text Ã¢â‚¬Å¥

Any ideas? 有任何想法吗？ This is a horrible mess, if you ask me, but it is vital I understand it... 如果你问我，这是一个可怕的混乱，但我理解它是至关重要的...

Answer 1

Encoding 编码

Every file is encoded. 每个文件都经过编码。 The byte 0x4C is interpreted as latin capital letter L according to the ASCII encoding, but as less-than sign ('<') according to the EBCDIC encoding. 字节0x4C根据ASCII编码被解释为拉丁大写字母L，但根据EBCDIC编码被解释为小于号（'<'）。 There Ain't No Such Thing As Plain Text. 没有像平原那样的东西。

There are single byte character sets like ASCII that use a single byte to encode each symbol, there are double byte character sets like KS X 1001 that use two bytes to encode each symbol, and there are encodings like the popular UTF-8 that use a variable number of bytes per symbol. 像ASCII这样的单字节字符集使用单个字节来编码每个符号，有像KS X 1001这样的双字节字符集，它使用两个字节来编码每个符号，并且有像流行的UTF-8这样的编码使用每个符号可变的字节数。

UTF-8 has become the most popular encoding for new applications, so I'll give some examples: The Latin Capital Letter A is stored as a single byte: 0x41 . UTF-8已成为新应用程序最流行的编码，因此我将举几个例子：拉丁大写字母A存储为单个字节： 0x41 。 The Left Double Quotation Mark (“) is stored as three bytes: 0xE2 0x80 0x9C . 左双引号（“）存储为三个字节： 0xE2 0x80 0x9C 。 The emoji Pile of Poo is stored as four bytes: 0xF0 0x9F 0x92 0xA9 . 表情符号堆的Poo存储为四个字节： 0xF0 0x9F 0x92 0xA9 。

Any program that reads a file and has to interpret the bytes as symbols has to know (or to guess) which encoding was used. 任何读取文件并必须将字节解释为符号的程序必须知道（或猜测）使用了哪种编码。

If you are not familiar with Unicode or UTF-8 you might want to read http://www.joelonsoftware.com/articles/unicode.html 如果您不熟悉Unicode或UTF-8，可能需要阅读http://www.joelonsoftware.com/articles/unicode.html

Reading Files in Python 3 在Python 3中读取文件

Python 3's builtin function open() has an optional keywords argument encoding to support different encodings. Python 3的内置函数open()具有可选的关键字参数encoding以支持不同的编码。 To open a UTF-8 encoded file you can write open(filename, encoding="utf-8") and Python will take care of the decoding. 要打开UTF-8编码文件，您可以编写open(filename, encoding="utf-8") ，Python将负责解码。

Also, the fileinput module supports encodings via the openhook keyword argument: fileinput.input(filename, openhook=fileinput.hook_encoded("utf-8")) . 此外， fileinput模块通过openhook关键字参数支持编码： fileinput.input(filename, openhook=fileinput.hook_encoded("utf-8")) 。

If you are not familiar with Python and Unicode or UTF-8 you should read http://docs.python.org/3/howto/unicode.html I also found some nice tricks in http://www.chirayuk.com/snippets/python/unicode 如果你不熟悉Python和Unicode或UTF-8，你应该阅读http://docs.python.org/3/howto/unicode.html我也在http://www.chirayuk.com/找到了一些不错的技巧代码段/蟒/ unicode的

Reading Strings in Python 2 阅读Python中的字符串2

In Python 2 open() does not know about encodings. 在Python 2中， open()不知道编码。 Instead you can use the codecs module to specify which encoding should be used: codecs.open(filename, encoding="utf-8") 相反，您可以使用codecs模块指定应使用的编码： codecs.open(filename, encoding="utf-8")

The best source for Python2/Unicode enlightment is http://docs.python.org/2/howto/unicode.html Python2 / Unicode启发的最佳来源是http://docs.python.org/2/howto/unicode.html

带有python和fileinput的Unicode文件

问题描述

1 个解决方案

解决方案1
9 2015-10-29 20:10:46

Encoding 编码

Reading Files in Python 3 在Python 3中读取文件

Reading Strings in Python 2 阅读Python中的字符串2

带有python和fileinput的Unicode文件

问题描述

1 个解决方案

解决方案1 9 2015-10-29 20:10:46

Encoding 编码

Reading Files in Python 3 在Python 3中读取文件

Reading Strings in Python 2 阅读Python中的字符串2

解决方案1
9 2015-10-29 20:10:46