簡體   English   中英

使用Python 2.6和utf-8的ValueError(“無法解碼JSON對象”)

[英]ValueError(“No JSON object could be decoded”) using Python 2.6 and utf-8

我正在嘗試編寫一組映射器/縮減器代碼供hadoop計算推文中的單詞數,但我遇到了一個問題。 我輸入的文件是收集的tweet信息的JSON文件。 我從設置默認編碼utf-8開始,但是在運行代碼時收到以下錯誤:

回溯(最近通話最后一個):文件“./mapperworks2.py”,線路211,在my_json_dict = json.loads(線)文件“/usr/lib/python2.6/json/ 初始化的.py”,線路307,在加載中返回_default_decoder.decode(s)文件“ /usr/lib/python2.6/json/decoder.py”,第319行,在解碼obj中,end = self.raw_decode(s,idx = _w(s,0) .end())raw_decode中的文件“ /usr/lib/python2.6/json/decoder.py”,行338引發ValueError(“無法解碼JSON對象”)ValueError:無法解碼JSON對象

該程序的代碼在哪里

#!/usr/bin/python


import sys

import json

import string

reload(sys)
sys.setdefaultencoding('utf8')

stop_words = ['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 "can't",
 'cannot',
 'could',
 "couldn't",
 'did',
 "didn't",
 'do',
 'does',
 "doesn't",
 'yourselves']

numbers = ["0","1","2","3","4","5","6","7","8","9"]

def clean_word(word):
    for c in string.punctuation:
        word = word.replace(c,"")
    for c in numbers:
        word = word.replace(c,"")
    return word

def dont_stop(word):
    if word in stop_words or word == "":
        return False
    else:
        return True



# input comes from STDIN (standard input)
for line in sys.stdin:
############
############
############
############
    my_json_dict = json.loads(line)
    line = my_json_dict['text'].lower()
############
############
############
############
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        ##################
        ##################
        word = clean_word(word)
        ##################
        ##################
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        ##################
        ##################
        if dont_stop(word):
            print '%s\t%s' % (word, 1)

當我不切換編碼時(也就是說,注釋掉reload(sys)和sys.setdefaultencoding(),我會遇到以下錯誤:

追溯(最近一次通話最近):文件“ ./mapperworks2.py”,行236,打印'%s \\ t%s'%(word,1)UnicodeEncodeError:'ascii'編解碼器無法編碼字符u'\\ u2026'位置> 3:序數不在范圍內(128)

不確定如何解決此問題,感謝您的幫助。

請參閱此處的討論: 在Python中管道輸出stdout時設置正確的編碼

您的錯誤是嘗試打印Unicode字符串以輸出。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM