将scala字符串转换为RDD [seq [string]]

Question

 // 4 workers
  val sc = new SparkContext("local[4]", "naivebayes")

  // Load documents (one per line).
  val documents: RDD[Seq[String]] = sc.textFile("/tmp/test.txt").map(_.split(" ").toSeq)

  documents.zipWithIndex.foreach{
  case (e, i) =>
  val collectedResult = Tokenizer.tokenize(e.mkString)
  }

  val hashingTF = new HashingTF()
  //pass collectedResult instead of document
  val tf: RDD[Vector] = hashingTF.transform(documents)

  tf.cache()
  val idf = new IDF().fit(tf)
  val tfidf: RDD[Vector] = idf.transform(tf)

in the above code snippet, i would want to extract collectedResult to reuse it for hashingTF.transform, How can this be achieved where the signature of tokenize function is 在上面的代码片段中，我想提取collectedResult以将其重用于hashingTF.transform，在tokenize函数的签名为

 def tokenize(content: String): Seq[String] = {
...
}

Answer 1

Looks like you want map rather than foreach . 看起来您想要map而不是foreach 。 I don't understand what you're using zipWithIndex for, nor why you're calling split on your lines only to join them straight back up again with mkString . 我不明白您在使用zipWithIndex做什么，也不明白为什么zipWithIndex调用split只是为了通过mkString直接将它们重新连接起来。

val lines: Rdd[String] = sc.textFile("/tmp/test.txt")
val tokenizedLines = lines.map(tokenize)
val hashes = tokenizedLines.map(hashingTF)
hashes.cache()
...

将scala字符串转换为RDD [seq [string]]

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-11-04 09:30:28

将scala字符串转换为RDD [seq [string]]

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-11-04 09:30:28

解决方案1
1 已采纳 2014-11-04 09:30:28