收到UnicodeEncodeError的Python腳本：“ ascii”編解碼器無法編碼字符

Question

我有一個簡單的Python腳本，可從reddit中提取帖子並將其發布到Twitter。 不幸的是，今晚它開始出現一些問題，我認為這是由於reddit上某人的標題存在格式問題。 我收到的錯誤是：

  File "redditbot.py", line 82, in <module>
  main()
 File "redditbot.py", line 64, in main
 tweeter(post_dict, post_ids)
 File "redditbot.py", line 74, in tweeter
 print post+" "+post_dict[post]+" #python"
 UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in  position 34: ordinal not in range(128)

這是我的腳本：

# encoding=utf8
import praw
import json
import requests
import tweepy
import time
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf8')

access_token = 'hidden'
access_token_secret = 'hidden'
consumer_key = 'hidden'
consumer_secret = 'hidden'


def strip_title(title):
    if len(title) < 75:
    return title
else:
    return title[:74] + "..."

def tweet_creator(subreddit_info):
post_dict = {}
post_ids = []
print "[bot] Getting posts from Reddit"
for submission in subreddit_info.get_hot(limit=2000):
    post_dict[strip_title(submission.title)] = submission.url
    post_ids.append(submission.id)
print "[bot] Generating short link using goo.gl"
mini_post_dict = {}
for post in post_dict:
    post_title = post
    post_link = post_dict[post]

    mini_post_dict[post_title] = post_link
return mini_post_dict, post_ids

def setup_connection_reddit(subreddit):
print "[bot] setting up connection with Reddit"
r = praw.Reddit('PythonReddit PyReTw'
            'monitoring %s' %(subreddit))
subreddit = r.get_subreddit('python')
return subreddit



def duplicate_check(id):
found = 0
with open('posted_posts.txt', 'r') as file:
    for line in file:
        if id in line:
            found = 1
return found

def add_id_to_file(id):
with open('posted_posts.txt', 'a') as file:
    file.write(str(id) + "\n")

def main():
subreddit = setup_connection_reddit('python')
post_dict, post_ids = tweet_creator(subreddit)
tweeter(post_dict, post_ids)

def tweeter(post_dict, post_ids):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
for post, post_id in zip(post_dict, post_ids):
    found = duplicate_check(post_id)
    if found == 0:
        print "[bot] Posting this link on twitter"
        print post+" "+post_dict[post]+" #python"
        api.update_status(post+" "+post_dict[post]+" #python")
        add_id_to_file(post_id)
        time.sleep(3000)
    else:
        print "[bot] Already posted"

if __name__ == '__main__':
main()

任何幫助將不勝感激-預先感謝！

Answer 1

考慮以下簡單程序：

print(u'\u201c' + "python")

如果嘗試打印到終端（使用適當的字符編碼），則會得到

“python

但是，如果嘗試將輸出重定向到文件，則會收到UnicodeEncodeError 。

script.py > /tmp/out
Traceback (most recent call last):
  File "/home/unutbu/pybin/script.py", line 4, in <module>
    print(u'\u201c' + "python")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

當您打印到終端時，Python使用終端的字符編碼來編碼unicode。 （終端只能打印字節，因此必須對unicode進行編碼才能打印。）

當您將輸出重定向到文件時，Python無法確定字符編碼，因為文件沒有聲明的編碼。 因此，默認情況下，Python2在寫入文件之前使用ascii編碼隱式編碼所有unicode。 由於無法對u'\“'進行ascii編碼，因此會出現UnicodeEncodeError 。 （只有前127個unicode碼點可以使用ascii編碼）。

為什么打印失敗Wiki中詳細解釋了此問題。

要解決此問題，首先，避免添加unicode和字節字符串。 這將導致在Python2中使用ascii編解碼器進行隱式轉換，而在Python3中導致異常。 為了使您的代碼適應未來需求，最好是明確的。 例如，在格式化和打印字節之前，顯式地編碼post ：

post = post.encode('utf-8')
print('{} {} #python'.format(post, post_dict[post]))

Answer 2

您正在嘗試將unicode字符串打印到終端（或可能是通過IO重定向的文件），但是終端（或文件系統）使用的編碼是ASCII。 因此，Python嘗試將其從unicode表示形式轉換為ASCII，但是失敗了，因為無法用ASCII表示代碼點u'\“' （ “ ）。 實際上，您的代碼正在執行此操作：

>>> print u'\u201c'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

您可以嘗試轉換為UTF-8：

print (post + " " + post_dict[post] + " #python").encode('utf8')

或像這樣轉換為ASCII：

print (post + " " + post_dict[post] + " #python").encode('ascii', 'replace')

哪個將替換無效的ASCII字符? 。

如果您出於調試目的而打印，則另一種有用的方法是打印字符串的repr ：

print repr(post + " " + post_dict[post] + " #python")

這將輸出如下內容：

>>> s = 'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'
>>> print repr(s)
u'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'

Answer 3

該問題可能是由於串聯時混合了字節串和unicode串而引起的。 作為給所有字符串文字加上u前綴的替代方法，也許

from __future__ import unicode_literals

為您解決問題。 請參閱此處以獲得更深入的解釋，並確定是否適合您。

收到UnicodeEncodeError的Python腳本：“ ascii”編解碼器無法編碼字符

問題描述

3 個解決方案

解決方案1
3 2016-01-17 11:14:08

解決方案2
2 2016-01-17 11:00:48

解決方案3
1 2016-01-17 10:58:43

收到UnicodeEncodeError的Python腳本：“ ascii”編解碼器無法編碼字符

問題描述

3 個解決方案

解決方案1 3 2016-01-17 11:14:08

解決方案2 2 2016-01-17 11:00:48

解決方案3 1 2016-01-17 10:58:43

解決方案1
3 2016-01-17 11:14:08

解決方案2
2 2016-01-17 11:00:48

解決方案3
1 2016-01-17 10:58:43