使用Python 2.6和utf-8的ValueError（“無法解碼JSON對象”）

Question

我正在嘗試編寫一組映射器/縮減器代碼供hadoop計算推文中的單詞數，但我遇到了一個問題。 我輸入的文件是收集的tweet信息的JSON文件。 我從設置默認編碼utf-8開始，但是在運行代碼時收到以下錯誤：

回溯（最近通話最后一個）：文件“./mapperworks2.py”，線路211，在my_json_dict = json.loads（線）文件“/usr/lib/python2.6/json/ 初始化的.py”，線路307，在加載中返回_default_decoder.decode（s）文件“ /usr/lib/python2.6/json/decoder.py”，第319行，在解碼obj中，end = self.raw_decode（s，idx = _w（s，0） .end（））raw_decode中的文件“ /usr/lib/python2.6/json/decoder.py”，行338引發ValueError（“無法解碼JSON對象”）ValueError：無法解碼JSON對象

該程序的代碼在哪里

#!/usr/bin/python


import sys

import json

import string

reload(sys)
sys.setdefaultencoding('utf8')

stop_words = ['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 "can't",
 'cannot',
 'could',
 "couldn't",
 'did',
 "didn't",
 'do',
 'does',
 "doesn't",
 'yourselves']

numbers = ["0","1","2","3","4","5","6","7","8","9"]

def clean_word(word):
    for c in string.punctuation:
        word = word.replace(c,"")
    for c in numbers:
        word = word.replace(c,"")
    return word

def dont_stop(word):
    if word in stop_words or word == "":
        return False
    else:
        return True



# input comes from STDIN (standard input)
for line in sys.stdin:
############
############
############
############
    my_json_dict = json.loads(line)
    line = my_json_dict['text'].lower()
############
############
############
############
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        ##################
        ##################
        word = clean_word(word)
        ##################
        ##################
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        ##################
        ##################
        if dont_stop(word):
            print '%s\t%s' % (word, 1)

當我不切換編碼時（也就是說，注釋掉reload（sys）和sys.setdefaultencoding（），我會遇到以下錯誤：

追溯（最近一次通話最近）：文件“ ./mapperworks2.py”，行236，打印'％s \\ t％s'％（word，1）UnicodeEncodeError：'ascii'編解碼器無法編碼字符u'\\ u2026'位置> 3：序數不在范圍內（128）

不確定如何解決此問題，感謝您的幫助。

Answer 1

請參閱此處的討論：在Python中管道輸出stdout時設置正確的編碼

您的錯誤是嘗試打印Unicode字符串以輸出。

使用Python 2.6和utf-8的ValueError（“無法解碼JSON對象”）

問題描述

1 個解決方案

解決方案1
0 2017-12-11 22:44:34

使用Python 2.6和utf-8的ValueError（“無法解碼JSON對象”）

問題描述

1 個解決方案

解決方案1 0 2017-12-11 22:44:34

解決方案1
0 2017-12-11 22:44:34