如何用3個值減少reduceByKey？

Question

我試圖遍歷文本文件的RDD，對文件中的每個唯一單詞求和，然后累加每個唯一單詞后面的所有單詞及其計數。 到目前為止，這就是我所擁有的：

// connecting to spark driver
val conf = new SparkConf().setAppName("WordStats").setMaster("local")
val spark = new SparkContext(conf) //Creates a new SparkContext object

//Loads the specified file into an RDD
val lines = sparkContext.textFile(System.getProperty("user.dir") + "/" + "basketball_words_only.txt")

//Splits the file into individual words
val words = lines.flatMap(line => {

  val wordList = line.split(" ")

  for {i <- 0 until wordList.length - 1}

    yield (wordList(i), wordList(i + 1), 1)

})

如果到目前為止我還不清楚，我想做的就是累積文件中每個單詞之后的單詞集，以及所述單詞跟隨其前一個單詞的次數，形式為：

（PrecedingWord，（FollowingWord，numberOfTimesWordFollows））

數據類型為（String，（String，Int））

Answer 1

您可能需要遵循以下原則：

(for {
  line <- lines
  Array(word1, word2) <- line.split("\\s+").sliding(2)
} yield ((word1, word2), 1))
 .reduceByKey(_ + _)
 .map({ case ((word1, word2), count) => (word1, (word2, count)) })

順便說一句，你可能希望確保每一個的“線” lines RDD相當於句話讓你不跨越界限句子計數的單詞對。 另外，如果您還沒有考慮過，可以考慮使用像OpenNLP或CoreNLP這樣的自然語言處理庫來進行句子邊界檢測，標記化等。

如何用3個值減少reduceByKey？

問題描述

1 個解決方案

解決方案1
0 2017-04-23 09:41:11

如何用3個值減少reduceByKey？

問題描述

1 個解決方案

解決方案1 0 2017-04-23 09:41:11

解決方案1
0 2017-04-23 09:41:11