简体   繁体   中英

Euclidean Distance between two points on Pyspark

I have defined a function in pyspark to calculate the euclidean distance between my centroids and a bunch of points i have.

def dist(x):
  b = {'d1':distance.euclidean((6,8),x),'d2':distance.euclidean((1,2),x),'d3':distance.euclidean((5,5),x)}
  def get_key(val):
    for key, value in b.items():
      if val == value:
        return key
  print(get_key(min(b.values())))

My points are as follows

data = [(3.023, 5.138), (3.075, 4.989), (2.321, 5.35), (3.328, 4.944), (3.195, 5.186)]

My objective is to feed all these points into my function and i get the nearest distance for each point. A hypothetical example of the output i am expecting is somewhat like this

[((3.023, 5.138),d1),
 ((3.075, 4.989),d1),
 ((2.321, 5.35),d2),
 ((3.328, 4.944),d1),
 ((3.195, 5.186),d3)]

When i feed individual points into this function it works perfectly, however, when i am trying to do this for multiple points using a lambda function, i am getting none instead of the centroids.

data.map(lambda x:(x,dist((x)))).take(5)

(1) Spark Jobs
Out[17]: [((3.023, 5.138), None),
 ((3.075, 4.989), None),
 ((2.321, 5.35), None),
 ((3.328, 4.944), None),
 ((3.195, 5.186), None)]

What am i doing wrong here? Would appreciate some help.

Your function dist doesn't return anything. It calls the print function, which returns nothing. Naturally, it prints None .

Change the print to return and I suspect you will be happier.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM