简体   繁体   中英

Can we define a set of DSL operation in Scala that perform parallelly with each other just like using pipe-line processing in Linux

Forgive me my poor English but I will try my best to express my question.

Suppose I want to process a large text whose operation is to filter content through a key word; change them to lowercase; and then print them onto the standard output. As we all know, we can do this using pipeline in Linux BASH script :

cat article.txt | grep "I" | tr "I" "i" > /dev/stdout

where cat article.txt , grep "I" , tr "I" "i" > /dev/stdout are running in parallel.

In Scala, we probably do it like this:

//or read from a text file , e.g. article.txt 
val strList = List("I", "am", "a" , "student", ".", "I", "come", "from", "China", ".","I","love","peace")  
strList.filter( _ == "I").map(_.toLowerCase).foreach(println)

My question is how we can make filter , map and foreach parallel?

thx

In 2.9, parallel collections were added. To parallelize the loop, all you have to do is to convert it by calling the par member function.

Your code would look like this:

val strList = List("I", "am", "a" , "student", ".", "I", "come", "from", "China", ".","I","love","peace")  // or read from a text file , e.g. article.txt 
strList.filter( _ == "I").map(_.toLowerCase).foreach(println)

If you change your List to an Iterator you'll see that the filter/map/foreach are not grouped anymore.

Try this:

val strList = Iterator("I", "am", "a" , "student", ".", "I", "come", "from", "China", ".","I","love","peace")  
strList.filter{ s => println("f"); s == "I"}.map{s => println("m"); s.toLowerCase}.foreach{s =>println("p")}

You'll see :
fmpfffffmpfffffmpff

Instead of: fffffffffffffmmmppp

Because when you apply a transformation to a List, it immediatly returns a new List. But when applying a transformation to an Iterator, it will only run when you traverse it (with the foreach in this case).

tstenner's solution is probably the most efficiency solution in your situation, since it can achieve a high degree of parallelism (each single item could be theoretically processed in parallel). However, your bash example is just using pipeline parallelism and this kind of parallelism is unfortunately not directly supported by Scalas parallel collections.

To achieve pipeline parallelism your operators (filter, map, foreach) have to be executed by different threads, eg, by using Actors.

In general I think it would be nice feature for Scala to have a simple API for that. But, for your example I doubt that pipeline parallelism would speedup your execution time that much. If you just use very simple filter and map operations I assume that the communication overhead (for FIFOs / Actor mailboxes) consumes the whole speedup of your parallel execution.

Use a view:

val strList = List("I", "am", "a" , "student", ".", "I", "come", "from", "China", ".","I","love","peace")  // or read from a text file , e.g. article.txt 
strList.view.filter( _ == "I").map(_.toLowerCase).foreach(println)

Views store the operations on collections ( filter and map in this case) and execute them only when you request elements from them ( foreach in this case). So first it would apply filter and map to "I", then to "am", and so on.

Create a function for a single argument from your chain of functions. Then apply this function to a parallel collection. Note that the println will not be called in order of the original collection.

def fmp(xs: Seq[String]){
  xs.par.foreach{x => 
    for(
      kw <- Option(x).filter(_ == "I"); 
      lc <- kw.map(_.toLowerCase)
    ) println(lc)
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM