简体   繁体   English

收到UnicodeEncodeError的Python脚本:“ ascii”编解码器无法编码字符

[英]Python script receiving a UnicodeEncodeError: 'ascii' codec can't encode character

I have a simple Python script that pulls posts from reddit and posts them on Twitter. 我有一个简单的Python脚本,可从reddit中提取帖子并将其发布到Twitter。 Unfortunately, tonight it began having issues that I'm assuming are because of someone's title on reddit having a formatting issue. 不幸的是,今晚它开始出现一些问题,我认为这是由于reddit上某人的标题存在格式问题。 The error that I'm reciving is: 我收到的错误是:

  File "redditbot.py", line 82, in <module>
  main()
 File "redditbot.py", line 64, in main
 tweeter(post_dict, post_ids)
 File "redditbot.py", line 74, in tweeter
 print post+" "+post_dict[post]+" #python"
 UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in  position 34: ordinal not in range(128)

And here is my script: 这是我的脚本:

# encoding=utf8
import praw
import json
import requests
import tweepy
import time
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf8')

access_token = 'hidden'
access_token_secret = 'hidden'
consumer_key = 'hidden'
consumer_secret = 'hidden'


def strip_title(title):
    if len(title) < 75:
    return title
else:
    return title[:74] + "..."

def tweet_creator(subreddit_info):
post_dict = {}
post_ids = []
print "[bot] Getting posts from Reddit"
for submission in subreddit_info.get_hot(limit=2000):
    post_dict[strip_title(submission.title)] = submission.url
    post_ids.append(submission.id)
print "[bot] Generating short link using goo.gl"
mini_post_dict = {}
for post in post_dict:
    post_title = post
    post_link = post_dict[post]

    mini_post_dict[post_title] = post_link
return mini_post_dict, post_ids

def setup_connection_reddit(subreddit):
print "[bot] setting up connection with Reddit"
r = praw.Reddit('PythonReddit PyReTw'
            'monitoring %s' %(subreddit))
subreddit = r.get_subreddit('python')
return subreddit



def duplicate_check(id):
found = 0
with open('posted_posts.txt', 'r') as file:
    for line in file:
        if id in line:
            found = 1
return found

def add_id_to_file(id):
with open('posted_posts.txt', 'a') as file:
    file.write(str(id) + "\n")

def main():
subreddit = setup_connection_reddit('python')
post_dict, post_ids = tweet_creator(subreddit)
tweeter(post_dict, post_ids)

def tweeter(post_dict, post_ids):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
for post, post_id in zip(post_dict, post_ids):
    found = duplicate_check(post_id)
    if found == 0:
        print "[bot] Posting this link on twitter"
        print post+" "+post_dict[post]+" #python"
        api.update_status(post+" "+post_dict[post]+" #python")
        add_id_to_file(post_id)
        time.sleep(3000)
    else:
        print "[bot] Already posted"

if __name__ == '__main__':
main()

Any help would be very much appreciated - thanks in advance! 任何帮助将不胜感激-预先感谢!

Consider this simple program: 考虑以下简单程序:

print(u'\u201c' + "python")

If you try printing to a terminal (with an appropriate character encoding), you get 如果尝试打印到终端(使用适当的字符编码),则会得到

“python

However, if you try redirecting output to a file, you get a UnicodeEncodeError . 但是,如果尝试将输出重定向到文件,则会收到UnicodeEncodeError

script.py > /tmp/out
Traceback (most recent call last):
  File "/home/unutbu/pybin/script.py", line 4, in <module>
    print(u'\u201c' + "python")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

When you print to a terminal, Python uses the terminal's character encoding to encode unicode. 当您打印到终端时,Python使用终端的字符编码来编码unicode。 (Terminals can only print bytes, so unicode must be encoded in order to be printed.) (终端只能打印字节,因此必须对unicode进行编码才能打印。)

When you redirect output to a file, Python can not determine the character encoding since files have no declared encoding. 当您将输出重定向到文件时,Python无法确定字符编码,因为文件没有声明的编码。 So by default Python2 implicitly encodes all unicode using the ascii encoding before writing to the file. 因此,默认情况下,Python2在写入文件之前使用ascii编码隐式编码所有unicode。 Since u'\“' can not be ascii encoded, a UnicodeEncodeError . 由于无法对u'\“'进行ascii编码,因此会出现UnicodeEncodeError (Only the first 127 unicode code points can be encoded with ascii). (只有前127个unicode码点可以使用ascii编码)。

This issue is explained in detail in the Why Print Fails wiki . 为什么打印失败Wiki中详细解释了此问题。


To fix the problem, first, avoid adding unicode and byte strings. 要解决此问题,首先,避免添加unicode和字节字符串。 This causes implicit conversion using the ascii codec in Python2, and an exception in Python3. 这将导致在Python2中使用ascii编解码器进行隐式转换,而在Python3中导致异常。 To future-proof your code, it is better to be explicit. 为了使您的代码适应未来需求,最好是明确的。 For example, encode post explicitly before formatting and printing the bytes: 例如,在格式化和打印字节之前,显式地编码post

post = post.encode('utf-8')
print('{} {} #python'.format(post, post_dict[post]))

You are trying to print a unicode string to your terminal (or possibly a file by IO redirection), but the encoding used by your terminal (or file system) is ASCII. 您正在尝试将unicode字符串打印到终端(或可能是通过IO重定向的文件),但是终端(或文件系统)使用的编码是ASCII。 Because of this Python attempts to convert it from the unicode representation to ASCII, but fails because codepoint u'\“' ( ) can not be represented in ASCII. 因此,Python尝试将其从unicode表示形式转换为ASCII,但是失败了,因为无法用ASCII表示代码点u'\“' )。 Effectively your code is doing this: 实际上,您的代码正在执行此操作:

>>> print u'\u201c'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

You could try converting to UTF-8: 您可以尝试转换为UTF-8:

print (post + " " + post_dict[post] + " #python").encode('utf8')

or convert to ASCII like this: 或像这样转换为ASCII:

print (post + " " + post_dict[post] + " #python").encode('ascii', 'replace')

which will replace invalid ASCII characters with ? 哪个将替换无效的ASCII字符? .

Another way, which is useful if you are printing for debugging purposes, is to print the repr of the string: 如果您出于调试目的而打印,则另一种有用的方法是打印字符串的repr

print repr(post + " " + post_dict[post] + " #python")

which would output something like this: 这将输出如下内容:

>>> s = 'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'
>>> print repr(s)
u'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'

The problem likely arises from mixing bytestrings and unicode strings on concatenation. 该问题可能是由于串联时混合了字节串和unicode串而引起的。 As an alternative to prefixing all string literals with u , maybe 作为给所有字符串文字加上u前缀的替代方法,也许

from __future__ import unicode_literals

fixes things for you. 为您解决问题。 See here for a deeper explanation and to decide whether it's an option for you or not. 请参阅此处以获得更深入的解释,并确定是否适合您。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符[...] - UnicodeEncodeError: 'ascii' codec can't encode character […] Python3中的“ UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符” - “UnicodeEncodeError: 'ascii' codec can't encode character” in Python3 Python错误:UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符 - Python error : UnicodeEncodeError: 'ascii' codec can't encode character UnicodeEncodeError:&#39;ascii&#39;编解码器无法使用python脚本编码字符u&#39;\\ u200f&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\u200f' with python script UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符u&#39;\\ xe4&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符u&#39;\\ xef&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' UnicodeEncodeError: &#39;ascii&#39; 编解码器无法在打印功能中编码字符 - UnicodeEncodeError: 'ascii' codec can't encode character in print function PySpark — UnicodeEncodeError: &#39;ascii&#39; 编解码器无法编码字符 - PySpark — UnicodeEncodeError: 'ascii' codec can't encode character UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符u&#39;\\ xe9&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符u&#39;\\ xa3&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM