简体   繁体   English

如何在Python中将unicode对象写入文件?

[英]How to write a unicode object into a file in Python?

I try to write a "string" to a file and get the following error message: 我尝试将“字符串”写入文件并得到以下错误消息:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xcd' in position 6: ordinal not in range(128)

I tried the following methods: 我尝试了以下方法:

print >>f, txt
print >>f, txt.decode('utf-8')
print >>f, txt.encode('utf-8')

None of them work. 他们都不工作。 I have the same error message. 我有同样的错误信息。

What is the idea behind encoding and decoding? 编码和解码背后的想法是什么? If I have a unicode object can I write it to the file directly or I need to transform it to a string? 如果我有一个unicode对象,可以直接将其写入文件中还是需要将其转换为字符串?

How can I find out what codding is used? 我如何找出使用了什么编码? How can I know if it is utf-8 or ascii or something else? 我怎么知道它是utf-8还是ascii或其他?

ADDED 添加

I think I have just managed to save a string into a file. 我想我刚刚设法将字符串保存到文件中。 print >>f, txt as well as print >>f, txt.decode('utf-8') did not work but print >>f, txt.encode('utf-8') works. print >>f, txt以及print >>f, txt.decode('utf-8')不起作用,但是print >>f, txt.encode('utf-8')起作用。 I get no error message and I see Chinese characters in my file. 我没有收到错误消息,并且在文件中看到了中文字符。

I recently posted another answer that addresses this very issue. 我最近发布了另一个解决此问题的答案 Key quote: 关键语录:

For a good overview of the difference, read one of Joel's articles , but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. 为了更好地了解它们之间的区别,请阅读Joel的文章之一 ,但要点是字节是字节(8位的组,没有附加任何其他含义),而字符是组成文本字符串的东西。 Encoding turns characters into bytes, and decoding turns bytes back into characters. 编码将字符转换成字节,而解码将字节转换成字符。

In Python 2, unicode objects are character strings. 在Python 2中, unicode对象是字符串。 Regular str objects can be either character strings or byte strings. 常规str对象可以是字符串或字节字符串。 (Pro tip: use Python 3, it makes keeping track a lot easier.) (专业提示:使用Python 3,使跟踪变得容易得多。)

You should be passing character strings (not byte strings) to print , but you will need to be sure that those character strings can be encoded by the codec (such as ASCII or UTF-8) associated with the destination file object f . 你应该通过字符串 (不是字节字符串)来print ,但你必须确保这些字符串可以通过编解码器进行编码(如ASCII或UTF-8)与目标文件对象关联f As part of the output process, Python encodes the string for you. 作为输出过程的一部分,Python会为您编码字符串。 If the string contains characters that cannot be encoded by the file object's codec, you will get errors like the one you're seeing. 如果该字符串包含文件对象的编解码器无法编码的字符,则会出现类似您所看到的错误。

Without knowing what is in your txt object I can't be more specific. 不知道您的txt对象中有什么,我无法更具体地说明。

I think you need to use codecs library: 我认为您需要使用编解码器库:

import codecs

file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\xcd')
file.close()

Works fine. 工作正常。

The Story of Encoding/Decoding: 编码/解码的故事:

In the past, there were only about ~60 characters available in computers (including upper-case and lower-case letters + numbers + some special characters). 过去,计算机中大约只有60个字符(包括大写和小写字母+数字+一些特殊字符)。 So only 1 byte was enough to assign a unique number to each letter. 因此,只有1个字节足以为每个字母分配一个唯一的数字。 Assigning numbers to letters for storing in memory is called encoding. 将数字分配给要存储在内存中的字母称为编码。 This one byte encoding that is used in python by default is named ASCII . 默认情况下,在python中使用的这一一字节编码称为ASCII

With growth of computers in the world, we need to have more letters and characters in computer. 随着世界计算机的发展,我们需要在计算机中增加字母和字符。 So 1 byte is not enough. 因此1个字节是不够的。 Different encoding schemes appeared. 出现了不同的编码方案。 Unicode is one of the famous. Unicode是著名的之一。 The character that you are trying to store in your file is a Unicode character and it need 2 bytes, So you must explicitly indicate to Python that you don't want to use the default encoding, ie the ASCII (because you need 2 bytes for this character). 您要存储在文件中的字符是Unicode字符,需要2个字节,因此您必须向Python明确表示您不想使用默认编码,即ASCII(因为您需要2个字节用于此字符)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM