简体   繁体   中英

Parallel Stream in scala

is there any (hopefully out of the box) way to traverse an Scala stream in parallel?

for instance, see this java 8 code:

String[] s = {"a","b","c","d","e"};
List<String> list = Arrays.asList(s);
list.parallelStream().forEach(System.out::println);

this will print all list stream contents in parallel.. however, to my understanding, streams in scala are sequential.

any workarounds for this?

EDIT: please notice that, streams allows us to process data as they arrive. then, if data is not necessary, remove them from memory. for instance:

"abcd".toStream.filter { x => 
  println(s"1 filter $x")  
   if(x.toInt%2==0) true;else false;
  } //end of first block
  .foreach { x => 
  println(s"2 filter->$x")  
  } //end of second block

will output something like this:

1 filter a

1 filter b

2 filter->b

1 filter c

1 filter d

2 filter->d

on the other hand, the below code, will process data in blocks. keeping variables in memory on each transformation:

  "abcd".toVector.par.filter { x => 
  println(s"1 filter $x")  
   if(x.toInt%2==0) true;else false;
  } //end of first block
  .foreach { x => 
  println(s"2 filter->$x")  
  } //end of second block

output: 1 filter c

1 filter a

1 filter b

1 filter d

2 filter->b

2 filter->d

Many (most?) Scala collections have a par method that "returns a parallel implementation of this collection."

From the ScalaDocs:

For most collection types, this method creates a new parallel collection by copying all the elements. For these collections, par takes linear time.

A Scala Stream[] has no direct parallel implementation, so you get ParSeq[] instead, and since ParSeq is a trait, the REPL will instantiate it as a ParVector .

scala> Stream("a","b","c","d","e").par
res0: scala.collection.parallel.immutable.ParSeq[String] = ParVector(a, b, c, d, e)

Also worth noting is the information elsewhere in the ScalaDocs:

The higher-order functions passed to certain operations may contain side-effects. Since implementations of bulk operations may not be sequential, this means that side-effects may not be predictable and may produce data-races, deadlocks or invalidation of state if care is not taken. It is up to the programmer to either avoid using side-effects or to use some form of synchronization when accessing mutable data.

So your foreach(println) code might have unpredictable/undesirable results.

You can use parallel collections

import scala.collection.parallel.immutable.ParVector

val pv = new ParVector[Int]

val pv = Vector(1,2,3,4,5,6,7,8,9).par

pv.foreach(x => println(x));

At the current time I'm aware of two possibilities that might usefully be pursued.

You should be able to use the Java 8 Stream API directly, assuming, of course, that you're running Scala on a JVM.

Alternatively, I think you might investigate Apache Spark. I've only started tinkering with this, but as I interpret it, while a large part of its power derives from sharding work across multiple machines, it nevertheless provides a parallel execution mode even on a single machine. Design-wise, it appears to be a "Streams-on-steroids" thing and seems like it does laziness if your data source allows it. I'll be pursuing this further myself, so any updates will be of interest to me too!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM