简体   繁体   中英

Splitting an RDD[String] type text to RDD[String] type words (Scala, Apache Spark)

I'm working with Apache Spark and Scala and have a text RDD[String] of the lines in the text. I'd like to split it into words (as in split it at every space) and get out another RDD[String] consisting of the separate words.

I've tried splitting the text at every space but don't know how to convert the Array[String] to RDD[String].

val lines = sc.textFile(filename)

val words = lines.map(line => line.split('_'))

I've also tried

val words = lines.flatMap(line => line.split('_')).collect()

but am still getting out an Array[String]

As a different approach I've tried to get the indices of the spaces and then splitting the lines at those indices but hit the block every time when having to work with the separate lines with different amounts and places of the spaces and getting the Array[Int] out of the RDD[Array[Int]].

val spaces = lines.map(line => line.zipWithIndex.filter(_._1 == ' ').map(_._2))

Can anyone help?

Use flatmap if your map operation returns some collection but you want to flatten the result into an rdd of all the individual elements.

val words = lines.flatMap(line => line.split('_'))

Will turn lines into an RDD[String] where each sting in the rdd is an individual word. split returns an array of all the words, be because it's in a flatmap the results are "flattened" out into the individual elements.

You already had this but you added a collect() at the end. collect() takes all the data from an RDD and loads it into an Array on the cluster. In other words it turns an RDD into an array. If you want things to stay in the RDD all you need to do is not call collect()

val lines = sc.parallelize(List("there are", "some words"), 2)

val words1 = lines.map(l => l.split(" ")) // => words1: Rdd[Array[String]] => word1.collect => Array(Array(there, are), Array(some, words))

val words2 = lines.flatMap(_.split(" ") // words2: Rdd[String] => words2.collect => Array(there, are, some, words)

There are two types of Spark Operations : Transformations and Actions. Transformations are lazy evaluated, otherwhile Actions will return final result to Driver program or write it out to file system. so, you should consider when you working with a large dataset.

When we read from sparkContext.textFile we already have RDD[String]
In your case with

val lines = sc.textFile(filename)

you already have RDD[String]
And the map function

val words = lines.map(line => line.split('_'))

splits the String of RDD[String] into Array thus turning it to RDD[Array[String]]
You still have an RDD
Now in case if you are looking for RDD[RDD[String]] , you can do

val words = lines.map(line => sparkContext.parallelize(line.split('_')))

And flatMap output every splitted words into separate lines so

val words = lines.flatMap(line => line.split('_'))

should be of RDD[String]
And

collect() turned RDD[String] into an Array[String]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM