简体   繁体   English

为什么我不能将文件另存为utf-8格式

[英]why can't I save my file as utf-8 format

I want to save a string to a new txt file. 我想将字符串保存到新的txt文件中。

The encoding of the string is 'utf-8'(I think so) and it contains some Chinese character 字符串的编码为“ utf-8”(我认为是这样),并且包含一些汉字

But the file's is GB2312 但是文件是GB2312

here is my code,I omit some: 这是我的代码,我省略了一些:

# -*- coding:utf-8 -*-
# Python 3.4 window 7

def getUrl(self, url, coding='utf-8'):
    self.__reCompile = {}
    req = request.Request(url)
    req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 UBrowser/5.5.9703.2 Safari/537.36')
    with request.urlopen(req) as response:
        return response.read().decode(coding)

def saveText(self,filename,content,mode='w'):
    self._checkPath(filename)
    with open(filename,mode) as f:
        f.write(content)

joke= self.getUrl(pageUrl)
#some re transform such as re.sub('<br>','\r\n',joke)
self.saveText(filepath+'.txt',joke,'a')

Sometimes there is an UnicodeEncodeError: 有时会出现UnicodeEncodeError: 在此处输入图片说明

Your exception is thrown in 'saveText', but I can't see how you implemented it so I'll try to reproduce the error and the give a suggestion to a fix. 您的异常抛出在'saveText'中,但是我看不到您是如何实现的,因此我将尝试重现该错误并提出修复建议。

In 'getUrl' you return a decoded string ( .decode('utf-8') ) and my guess is, that in 'saveText', you forget to encode it before writing to the file. 在'getUrl'中,您返回一个解码后的字符串(.decode('utf-8')),我的猜测是,在'saveText'中,您忘了对其进行编码,然后再写入文件。

Reproducing the error 重现错误

Trying to reproduce the error, I did this: 为了重现错误,我这样做:

# String with unicode chars, decoded like in you example
s = 'æøå'.decode('utf-8') 

# How saveText could be:
# Encode before write
f = open('test', mode='w')
f.write(s)
f.close()

this gives a similar exception: 这给出了类似的例外:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-36-1309da3ad975> in <module>()
      5 # Encode before write
      6 f = open('test', mode='w')
----> 7 f.write(s)
      8 f.close()

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Two ways of fixing 两种固定方式

You can do either: 您可以执行以下任一操作:

# String with unicode chars, decoded like in you example
s = 'æøå'.decode('utf-8') 

# How saveText could be:
# Encode before write
f = open('test', mode='w')
f.write(s.encode('utf-8'))
f.close()

or you can try writing the file using the module 'codecs': 或者您可以尝试使用模块“编解码器”写入文件:

import codecs

# String with unicode chars, decoded like in you example
s = 'æøå'.decode('utf-8') 

# How saveText could be:
f = codecs.open('test', encoding='utf-8', mode='w')
f.write(s)  
f.close()

Hope this helps. 希望这可以帮助。

The encoding of the string is 'utf-8'(I think so) and it contains some Chinese character 字符串的编码为“ utf-8”(我认为是这样),并且包含一些汉字

You've decoded the response from the remote server using UTF-8. 您已经使用UTF-8解码了来自远程服务器的响应。 Once it's decoded to a Python string, it's no longer encoded and stored effectively as Unicode points in memory. 一旦将其解码为Python字符串,便不再以Unicode点的形式进行有效编码和存储在内存中。

The error you're getting is because Python is trying to use your codepage to convert the string to bytes. 您收到的错误是因为Python试图使用您的代码页将字符串转换为字节。 Due to your Windows region settings, it's chosen GBK, which doesn't support all of the Unicode characters. 由于您的Windows区域设置,选择了GBK,它不支持所有Unicode字符。

To save, you simply need to open the output file with a specified encoding, using the encoding argument to open() (Python 3. In Python 2, use io.open() ). 要保存,您只需要使用指定的编码来打开输出文件,就使用了open()encoding参数(Python3。在Python 2中,使用io.open() )。 In your case, "UTF-8" may be appropriate encoding to use. 在您的情况下,“ UTF-8”可能是适合使用的编码。

Your saveText() method needs to updated to: 您的saveText()方法需要更新为:

def saveText(self,filename,content,mode='w',encoding="utf-8"):
    self._checkPath(filename)
    with open(filename,mode,encoding) as f:
        f.write(content)

You may run into a issue with your HTTP data. 您的HTTP数据可能会出现问题。 You're assuming the remote content is UTF-8 when you decode the response. 您假设解码响应时远程内容为UTF-8。 This won't always be the case. 情况并非总是如此。 You could analyse the HTTP response headers to get the right encoding or use Requests library, which does this for you. 您可以分析HTTP响应标头以获得正确的编码,或使用Requests库来完成此任务。 Your URL getter would look like: 您的URL获取器如下所示:

def getUrl(url):
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 UBrowser/5.5.9703.2 Safari/537.36'}
    response = requests.get(url, headers=headers)
    response.raise_for_status() # Throw an exception on errors
    return response.text

I think that the encoding your terminal is using doesn't support that character. 我认为您的终端使用的编码不支持该字符。 Python is handling it just fine, I think it's your output encoding that cannot handle it. Python可以很好地处理它,我认为是您的输出编码无法处理它。

See also this question 另请参阅此问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM