I'm new to Scala and tried multiple things to convert RDD[Array[(String,String)]]
type to RDD[(String,String)]
.
What I want to achive is to select from a Json two elements (text and category). For every word in the text, I just want to create a key/value pair in the form (word1, category), (word2, category), ....
My example looks like this:
import org.json4s._
import org.json4s.jackson.JsonMethods._
// Example Json-line: {"reviewText": "This was a gift!", "category": "Apps"}"
val rdd = sc.textFile(PathToJSONFile)
rdd.map{
row =>
val json_row = parse(row)
val myCategory = compact(json_row \ "category").toString
val myText = compact(json_row \ "reviewText").toString.toLowerCase.split("[#&$!]").map(_.trim).filter(_.length > 1)
myText.map{word => (word, myCategory)}
}
The output is org.apache.spark.rdd.RDD[Array[(String, String)]]
and looks like this:
Array(Array((this,"Apps"), (was,"Apps"), (a,"Apps"), (gift,"Apps"))
But what I want to achieve is a key value pair in the form of RDD[(String,String)]
(where key is a word and the value is the same category for every word in this line)
How can I achieve this? Many thanks!
The suggestions from Psidom solved the problem. Changing rdd.map to rdd.flatMap
was the solution.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.