简体   繁体   中英

Spark scala method combinations

guys I have a problem with the method combinations

My code :

 val myRDD = sc.parallelize(Seq("aaa bbb bbb")) myRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27 scala> myRDD.foreach{println} aaa bbb bbb scala> myRDD.map(_.split(" ")).flatMap(_.combinations(2)). | map(p=>(p.mkString(","),1)). | reduceByKey(_+_). | foreach{println} (aaa,bbb,1) (bbb,bbb,1) 

I dont' why the output is not

 (aaa,bbb,2) (bbb,aaa,2) (bbb,bbb,1) 

In combination function , a combination of length n is the subsequence of the original sequence, with elements taken in order . So in your case, for (aaa,bbb,bbb) the possible subsequences are (aaa,bbb) and (bbb,bbb) but not (bbb,aaa) .

Please refer scala documentation

The scala documentation covers this pretty well I think:

Iterates over combinations. A combination of length n is a subsequence of the original sequence, with the elements taken in order. Thus, "xy" and "yy" are both length-2 combinations of "xyy", but "yx" is not . If there is more than one way to generate the same subsequence, only one will be returned.

For example, "xyyy" has three different ways to generate "xy" depending on whether the first, second, or third "y" is selected. However, since all are identical, only one will be chosen . Which of the three will be taken is an implementation detail that is not defined.

In your specific case this breaks down to something like:

(aaa, bbb)
(aaa, bbb) //Thrown out since it duplicates the first
(bbb, bbb)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM