Spark: Sum of values over all keys?

Question

Suppose I have key, value pairs that comprise a userId and a list of boolean integers indicating that the user has an attribute:

userId     hasAttrA  hasAttrB  hasAttrC
joe               1         0         1
jack              1         1         0
jane              0         0         1
jeri              1         0         0

In Scala code, the data structure looks like:

var data = Array(("joe",  List(1, 0, 1)),
                 ("jack", List(1, 1, 0)),
                 ("jane", List(0, 0, 1)),
                 ("jeri", List(1, 0, 0)))

I would like to compute the fraction of all users that has the attributes. However, this computation requires that I can sum over all the keys, which I don't know how to do. So I would like to calculate:

How many users there are?

data.size // 4

What fraction of users has attribute A?

Should be: sum(hasAttrA) / data.size = 3/4 = 0.75

What fraction of users has attribute B?

Should be: sum(hasAttrB) / data.size = 1/4 = 0.25

etc.

How can I compute the sums across all the keys, and how can I compute the final percentages?

EDIT 2/24/2016:

I can manually find the sums of individual columns like so:

var sumAttributeA = data.map{ case(id, attributeList) => attributeList(0)}.sum
var sumAttributeB = data.map{ case(id, attributeList) => attributeList(1)}.sum
var sumAttributeC = data.map{ case(id, attributeList) => attributeList(2)}.sum

var fractionAttributeA = sumAttributeA.toDouble/data.size
//fractionAttributeA: Double = 0.75
var fractionAttributeB = sumAttributeB.toDouble/data.size
//fractionAttributeB: Double = 0.25

Answer 1

One possible solution:

import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
import org.apache.spark.mllib.linalg.Vectors

val stats = sc.parallelize(data)
  .values
  .map(xs => Vectors.dense(xs.toArray.map(_.toDouble)))
  .aggregate(new MultivariateOnlineSummarizer)(_ add _, _ merge _)

(stats.count, stats.mean)
// (Long, org.apache.spark.mllib.linalg.Vector) = (4,[0.75,0.25,0.5])

You can also apply a similar operation manually:

val (total, sums) = sc.parallelize(data).values
  .map(vs => (1L, vs.map(_.toLong)))
  .reduce{
    case ((cnt1, vs1), (cnt2, vs2)) => 
    (cnt1 + cnt2, vs1.zip(vs2).map{case (x, y) => x + y})}

sums.map(_.toDouble / total)

but it will have much worse numerical properties.

Spark: Sum of values over all keys?

Question

1 answers

solution1
2 ACCPTED 2016-02-23 23:34:21

Spark: Sum of values over all keys?

Question

1 answers

solution1 2 ACCPTED 2016-02-23 23:34:21

solution1
2 ACCPTED 2016-02-23 23:34:21