Spark 使用 Python：將 RDD 輸出保存到文本文件中

Question

我正在嘗試使用 python 在 spark 中解決字數問題。 但是當我嘗試使用 .saveAsTextFile 命令將輸出 RDD 保存在文本文件中時，我遇到了這個問題。 這是我的代碼。 請幫我。 我被困住了。 感謝您的時間。

import re

from pyspark import SparkConf , SparkContext

def normalizewords(text):
    return re.compile(r'\W+',re.UNICODE).split(text.lower())

conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)

input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")

words=input.flatMap(normalizewords)

wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()

results=sortedwordsCount.collect()

for result in results:
    count=str(result[0])
    word=result[1].encode('ascii','ignore')

    if(word):
        print word +"\t\t"+ count

results.saveAsTextFile("/var/www/myoutput")

Answer 1

因為你收集了results=sortedwordsCount.collect()所以，它不是 RDD。 它將是普通的 python 列表或元組。

如您所知list是 python 對象/數據結構， append是添加元素的方法。

>>> x = []
>>> x.append(5)
>>> x
[5]

同樣， RDD是火花對象/數據結構，而saveAsTextFile是寫入文件的方法。 重要的是它的分布式數據結構。

因此，我們不能在 RDD 上使用append或在列表上使用saveAsTextFile 。 collect是 RDD 上的方法，用於獲取 RDD 到驅動程序內存。

如評論中所述，使用 saveAsTextFile 保存sortedwordsCount或在 python 中打開文件並使用results寫入文件

Answer 2

將results=sortedwordsCount.collect()更改為results=sortedwordsCount ，因為使用.collect()結果將是一個列表。

Spark 使用 Python：將 RDD 輸出保存到文本文件中

問題描述

2 個解決方案

解決方案1
8 已采納 2015-12-04 11:26:33

解決方案2
1 2020-03-31 01:50:46

Spark 使用 Python：將 RDD 輸出保存到文本文件中

問題描述

2 個解決方案

解決方案1 8 已采納 2015-12-04 11:26:33

解決方案2 1 2020-03-31 01:50:46

解決方案1
8 已采納 2015-12-04 11:26:33

解決方案2
1 2020-03-31 01:50:46