简体   繁体   中英

How to reproduce flatMap using foreach and concat in Scala?

I'm trying to recreate the flatMap function using foreach and List.concat but the resulting list seems unchanged.

Here's the reference:

val rdd: List[String] = List("Hello sentence one",
                             "This is the next sentence",
                             "The last sentence")
val fm: List[String] = rdd.flatMap(s => s.split("\\W"))
println(fm)

which gives:

List(Hello, sentence, one, This, is, the, next, sentence, The, last, sentence)

And here is my approach to recreate the same:

val nonRdd: List[String] = List("Hello sentence one",
                                "This is the next sentence",
                                "The last sentence")
var nonfm: List[String] = List()
nonRdd.foreach(line => List.concat(nonfm, line.split("\\W")))
println("nonfm: " + nonfm)

So every line is split on word and the resulting, intermediate line is supposed to be concatenated to the previously initialized list nonfm .

However, nonfm is empty:

nonfm: List()

As I have mentioned in the comments section, List in Scala will default to scala.collection.immutable

As the documentation suggests , concat returns a new list rather than mutating the original one (it couldn't anyway since it's immutable)

Returns a new sequence containing the elements from the left hand operand followed by the elements from the right hand operand.

So you need to update the variable on every iteration with a simple assignment

val nonRdd: List[String] = List("Hello sentence one",
                                "This is the next sentence",
                                "The last sentence")
var nonfm: List[String] = List()
nonRdd.foreach(line => nonfm = List.concat(nonfm, line.split("\\W")))
println("nonfm: " + nonfm)

Based on the use of the word RDD, I am guessing you are going to be using Spark eventually. I am hoping you are simply experimenting and trying to understand how things work, but please do not ever use variables in Spark (or in Scala in general). See @Avishek's answer for why they will break your program in Spark

This is right behaviour. The variable var nonfm: List[String] = List() is defined in the master. When you run the nonRdd.foreach(line => List.concat(nonfm, line.split("\\W"))) each partition of the nonRdd gets its own copy of the nonfm .

When the foreach runs, master sends each rdd partition the closure ie code and variables by serializing it. These partitions might be running in completely different machine altogether.

Finally, when you do println("nonfm: " + nonfm) it prints the nonfm declared in the master. This copy of the variable hasn't been mutated at all. Thus provides empty result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM