简体   繁体   English

带有readlines()方法的Python3 UnicodeDecodeError

[英]Python3 UnicodeDecodeError with readlines() method

Trying to create a twitter bot that reads lines and posts them. 试图创建一个读取行并发布它们的twitter机器人。 Using Python3 and tweepy, via a virtualenv on my shared server space. 使用Python3和tweepy,通过我的共享服务器空间上的virtualenv。 This is the part of the code that seems to have trouble: 这是代码中似乎有问题的一部分:

#!/foo/env/bin/python3

import re
import tweepy, time, sys

argfile = str(sys.argv[1])

filename=open(argfile, 'r')
f=filename.readlines()
filename.close()

this is the error I get: 这是我得到的错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)

The error specifically points to f=filename.readlines() as the source of the error. 该错误特别指向f=filename.readlines()作为错误的来源。 Any idea what might be wrong? 知道什么可能是错的吗? Thanks. 谢谢。

I think the best answer (in Python 3) is to use the errors= parameter: 我认为最好的答案(在Python 3中)是使用errors=参数:

with open('evil_unicode.txt', 'r', errors='replace') as f:
    lines = f.readlines()

Proof: 证明:

>>> s = b'\xe5abc\nline2\nline3'
>>> with open('evil_unicode.txt','wb') as f:
...     f.write(s)
...
16
>>> with open('evil_unicode.txt', 'r') as f:
...     lines = f.readlines()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte
>>> with open('evil_unicode.txt', 'r', errors='replace') as f:
...     lines = f.readlines()
...
>>> lines
['�abc\n', 'line2\n', 'line3']
>>>

Note that the errors= can be replace or ignore . 请注意, errors=可以replaceignore Here's what ignore looks like: 这是ignore样子:

>>> with open('evil_unicode.txt', 'r', errors='ignore') as f:
...     lines = f.readlines()
...
>>> lines
['abc\n', 'line2\n', 'line3']

Your default encoding appears to be ASCII, where the input is more than likely UTF-8. 您的默认编码似乎是ASCII,其输入很可能是UTF-8。 When you hit non-ASCII bytes in the input, it's throwing the exception. 当您在输入中点击非ASCII字节时,它会抛出异常。 It's not so much that readlines itself is responsible for the problem; 并不是说readlines本身对这个问题负责; rather, it's causing the read+decode to occur, and the decode is failing. 相反,它导致读取+解码发生,并且解码失败。

It's an easy fix though; 这是一个简单的解决方案; the default open in Python 3 allows you to provide the known encoding of an input, replacing the default (ASCII in your case) with any other recognized encoding. Python 3中的默认open允许您提供输入的已知encoding ,将默认值(在您的情况下为ASCII)替换为任何其他可识别的编码。 Providing it allows you to keep reading as str (rather than the significantly different raw binary data bytes objects), while letting Python do the work of converting from raw disk bytes to true text data: 提供它允许您继续读取str (而不是显着不同的原始二进制数据bytes对象),同时让Python完成从原始磁盘字节转换为真实文本数据的工作:

# Using with statement closes the file for us without needing to remember to close
# explicitly, and closes even when exceptions occur
with open(argfile, encoding='utf-8') as inf:
    f = inf.readlines()

Ended up finding a working answer for myself: 结束为自己找到一个有效的答案:

filename=open(argfile, 'rb')

This post helped me out a lot. 这篇文章帮了我很多忙。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM