is there any (hopefully out of the box) way to traverse an Scala stream in parallel?
for instance, see this java 8 code:
String[] s = {"a","b","c","d","e"};
List<String> list = Arrays.asList(s);
list.parallelStream().forEach(System.out::println);
this will print all list stream contents in parallel.. however, to my understanding, streams in scala are sequential.
any workarounds for this?
EDIT: please notice that, streams allows us to process data as they arrive. then, if data is not necessary, remove them from memory. for instance:
"abcd".toStream.filter { x =>
println(s"1 filter $x")
if(x.toInt%2==0) true;else false;
} //end of first block
.foreach { x =>
println(s"2 filter->$x")
} //end of second block
will output something like this:
1 filter a
1 filter b
2 filter->b
1 filter c
1 filter d
2 filter->d
on the other hand, the below code, will process data in blocks. keeping variables in memory on each transformation:
"abcd".toVector.par.filter { x =>
println(s"1 filter $x")
if(x.toInt%2==0) true;else false;
} //end of first block
.foreach { x =>
println(s"2 filter->$x")
} //end of second block
output: 1 filter c
1 filter a
1 filter b
1 filter d
2 filter->b
2 filter->d
Many (most?) Scala collections have a par
method that "returns a parallel implementation of this collection."
From the ScalaDocs:
For most collection types, this method creates a new parallel collection by copying all the elements. For these collections,
par
takes linear time.
A Scala Stream[]
has no direct parallel implementation, so you get ParSeq[]
instead, and since ParSeq
is a trait, the REPL will instantiate it as a ParVector
.
scala> Stream("a","b","c","d","e").par
res0: scala.collection.parallel.immutable.ParSeq[String] = ParVector(a, b, c, d, e)
Also worth noting is the information elsewhere in the ScalaDocs:
The higher-order functions passed to certain operations may contain side-effects. Since implementations of bulk operations may not be sequential, this means that side-effects may not be predictable and may produce data-races, deadlocks or invalidation of state if care is not taken. It is up to the programmer to either avoid using side-effects or to use some form of synchronization when accessing mutable data.
So your foreach(println)
code might have unpredictable/undesirable results.
You can use parallel collections
import scala.collection.parallel.immutable.ParVector
val pv = new ParVector[Int]
val pv = Vector(1,2,3,4,5,6,7,8,9).par
pv.foreach(x => println(x));
At the current time I'm aware of two possibilities that might usefully be pursued.
You should be able to use the Java 8 Stream API directly, assuming, of course, that you're running Scala on a JVM.
Alternatively, I think you might investigate Apache Spark. I've only started tinkering with this, but as I interpret it, while a large part of its power derives from sharding work across multiple machines, it nevertheless provides a parallel execution mode even on a single machine. Design-wise, it appears to be a "Streams-on-steroids" thing and seems like it does laziness if your data source allows it. I'll be pursuing this further myself, so any updates will be of interest to me too!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.