[英]convert scala string to RDD[seq[string]]
// 4 workers
val sc = new SparkContext("local[4]", "naivebayes")
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("/tmp/test.txt").map(_.split(" ").toSeq)
documents.zipWithIndex.foreach{
case (e, i) =>
val collectedResult = Tokenizer.tokenize(e.mkString)
}
val hashingTF = new HashingTF()
//pass collectedResult instead of document
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
in the above code snippet, i would want to extract collectedResult to reuse it for hashingTF.transform, How can this be achieved where the signature of tokenize function is 在上面的代码片段中,我想提取collectedResult以将其重用于hashingTF.transform,在tokenize函数的签名为
def tokenize(content: String): Seq[String] = {
...
}
Looks like you want map
rather than foreach
. 看起来您想要map
而不是foreach
。 I don't understand what you're using zipWithIndex
for, nor why you're calling split
on your lines only to join them straight back up again with mkString
. 我不明白您在使用zipWithIndex
做什么,也不明白为什么zipWithIndex
调用split
只是为了通过mkString
直接将它们重新连接起来。
val lines: Rdd[String] = sc.textFile("/tmp/test.txt")
val tokenizedLines = lines.map(tokenize)
val hashes = tokenizedLines.map(hashingTF)
hashes.cache()
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.