[英]rdd.first() does not give an error but rdd.collect() does
我在pyspark中工作,並具有以下代碼,在其中處理推文並使用user_id和text進行RDD。 下面是代碼
"""
# Construct an RDD of (user_id, text) here.
"""
import json
def safe_parse(raw_json):
try:
json_object = json.loads(raw_json)
if 'created_at' in json_object:
return json_object
else:
return;
except ValueError as error:
return;
def get_usr_txt (line):
tmp = safe_parse (line)
return ((tmp.get('user').get('id_str'),tmp.get('text')));
usr_txt = text_file.map(lambda line: get_usr_txt(line))
print (usr_txt.take(5))
並且輸出看起來還不錯(如下所示)
[('470520068', "I'm voting 4 #BernieSanders bc he doesn't ride a CAPITALIST PIG adorned w/ #GoldmanSachs $. SYSTEM RIGGED CLASS WAR "), ('2176120173', "RT @TrumpNewMedia: .@realDonaldTrump #America get out & #VoteTrump if you don't #VoteTrump NOTHING will change it's that simple!\n#Trump htt…"), ('145087572', 'RT @Libertea2012: RT TODAY: #Colorado’s leading progressive voices to endorse @BernieSanders! #Denver 11AM - 1PM in MST CO State Capitol…'), ('23047147', '[VID] Liberal Tears Pour After Bernie Supporter Had To Deal With Trump Fans '), ('526506000', 'RT @justinamash: .@tedcruz is the only remaining candidate I trust to take on what he correctly calls the Washington Cartel. ')]
但是,我盡快
print (usr_txt.count())
我收到如下錯誤
Py4JJavaError Traceback (most recent call last)
<ipython-input-60-9dacaf2d41b5> in <module>()
8 usr_txt = text_file.map(lambda line: get_usr_txt(line))
9 #print (usr_txt.take(5))
---> 10 print (usr_txt.count())
11
/usr/local/spark/python/pyspark/rdd.py in count(self)
1054 3
1055 """
-> 1056 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
1057
1058 def stats(self):
我想念什么? 是否未正確創建RDD? 還是還有別的? 我如何解決它?
當解析的json行中沒有created_at元素或解析錯誤時,您已從safe_parse
方法返回None
。 從(tmp.get('user').get('id_str'),tmp.get('text'))
的解析json獲取元素時,這會產生錯誤。 導致錯誤發生
解決方案是在get_usr_txt
方法中檢查None
def get_usr_txt (line):
tmp = safe_parse(line)
if(tmp != None):
return ((tmp.get('user').get('id_str'),tmp.get('text')));
現在的問題是,為什么print (usr_txt.take(5))
顯示結果,而print (usr_txt.count())
導致錯誤
那是因為usr_txt.take(5)
僅考慮了前五個rdds,而不考慮了其余的rdds,並且不必處理None數據類型。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.