Apache Spark operation on .join dataset

Question

I am working on pyspark on the clustering mllib. In python we have only two API's one is predict which gives the cluster value for the point and one is cluster.centers which gives the cluster centers.

I was been asked to find the most densely populated cluster using the formula (number of points in cluster)/(radius of the cluster)^2

I have figured out a way to find both the values (number of points in cluster, radius of the cluster). Now I have two datasets in (K,V) format in which one of the dataset carries (clusterValue,Radius of the cluster) and other dataset has (ClusterValue,Number of points in the Cluster).

I am stuck here how to compute the value of the density using the two datasets. Is there is a way we can compute the values using the datasets?

I used .join RDD Transformation by which i was able to get the combined dataset (k,(v,w)) ie (clustervalue,(radius,number)) .But not able to figure out how to apply any function on this type of dataset. Please help me out if anyone of you have faced this issue before

I am using spark 1.1.1

Answer 1

You can apply any function to you joined RDD using the .map transformation, for instance to divide number by radius:

kvw=[("X",(2.0,1.0)),("Y",(3.0,2.0))]
kvwRDD = sc.parallelize(kvw)
kvwRDD.map(lambda (k,(v,w)): (k, w/v))

Answer 2

This is covered in http://spark.apache.org/docs/latest/programming-guide.html#basics .

densities = joined.map(
    lambda (cluster, (radius, number)): (cluster, number / radius / radius))

Apache Spark operation on .join dataset

Question

2 answers

solution1
1 ACCPTED 2015-02-18 12:18:20

solution2
0 2015-02-18 12:19:41

Apache Spark operation on .join dataset

Question

2 answers

solution1 1 ACCPTED 2015-02-18 12:18:20

solution2 0 2015-02-18 12:19:41

solution1
1 ACCPTED 2015-02-18 12:18:20

solution2
0 2015-02-18 12:19:41