Suppose I have key, value pairs that comprise a userId and a list of boolean integers indicating that the user has an attribute:
userId hasAttrA hasAttrB hasAttrC
joe 1 0 1
jack 1 1 0
jane 0 0 1
jeri 1 0 0
In Scala code, the data structure looks like:
var data = Array(("joe", List(1, 0, 1)),
("jack", List(1, 1, 0)),
("jane", List(0, 0, 1)),
("jeri", List(1, 0, 0)))
I would like to compute the fraction of all users that has the attributes. However, this computation requires that I can sum over all the keys, which I don't know how to do. So I would like to calculate:
data.size // 4
Should be: sum(hasAttrA) / data.size = 3/4 = 0.75
Should be: sum(hasAttrB) / data.size = 1/4 = 0.25
etc.
How can I compute the sums across all the keys, and how can I compute the final percentages?
EDIT 2/24/2016:
I can manually find the sums of individual columns like so:
var sumAttributeA = data.map{ case(id, attributeList) => attributeList(0)}.sum
var sumAttributeB = data.map{ case(id, attributeList) => attributeList(1)}.sum
var sumAttributeC = data.map{ case(id, attributeList) => attributeList(2)}.sum
var fractionAttributeA = sumAttributeA.toDouble/data.size
//fractionAttributeA: Double = 0.75
var fractionAttributeB = sumAttributeB.toDouble/data.size
//fractionAttributeB: Double = 0.25
One possible solution:
import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
import org.apache.spark.mllib.linalg.Vectors
val stats = sc.parallelize(data)
.values
.map(xs => Vectors.dense(xs.toArray.map(_.toDouble)))
.aggregate(new MultivariateOnlineSummarizer)(_ add _, _ merge _)
(stats.count, stats.mean)
// (Long, org.apache.spark.mllib.linalg.Vector) = (4,[0.75,0.25,0.5])
You can also apply a similar operation manually:
val (total, sums) = sc.parallelize(data).values
.map(vs => (1L, vs.map(_.toLong)))
.reduce{
case ((cnt1, vs1), (cnt2, vs2)) =>
(cnt1 + cnt2, vs1.zip(vs2).map{case (x, y) => x + y})}
sums.map(_.toDouble / total)
but it will have much worse numerical properties.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.