简体   繁体   中英

Apache Spark operation on .join dataset

I am working on pyspark on the clustering mllib. In python we have only two API's one is predict which gives the cluster value for the point and one is cluster.centers which gives the cluster centers.

I was been asked to find the most densely populated cluster using the formula (number of points in cluster)/(radius of the cluster)^2

I have figured out a way to find both the values (number of points in cluster, radius of the cluster). Now I have two datasets in (K,V) format in which one of the dataset carries (clusterValue,Radius of the cluster) and other dataset has (ClusterValue,Number of points in the Cluster).

I am stuck here how to compute the value of the density using the two datasets. Is there is a way we can compute the values using the datasets?

I used .join RDD Transformation by which i was able to get the combined dataset (k,(v,w)) ie (clustervalue,(radius,number)) .But not able to figure out how to apply any function on this type of dataset. Please help me out if anyone of you have faced this issue before

I am using spark 1.1.1

You can apply any function to you joined RDD using the .map transformation, for instance to divide number by radius:

kvw=[("X",(2.0,1.0)),("Y",(3.0,2.0))]
kvwRDD = sc.parallelize(kvw)
kvwRDD.map(lambda (k,(v,w)): (k, w/v))

This is covered in http://spark.apache.org/docs/latest/programming-guide.html#basics .

densities = joined.map(
    lambda (cluster, (radius, number)): (cluster, number / radius / radius))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM