Euclidean Distance between two points on Pyspark

Question

I have defined a function in pyspark to calculate the euclidean distance between my centroids and a bunch of points i have.

def dist(x):
  b = {'d1':distance.euclidean((6,8),x),'d2':distance.euclidean((1,2),x),'d3':distance.euclidean((5,5),x)}
  def get_key(val):
    for key, value in b.items():
      if val == value:
        return key
  print(get_key(min(b.values())))

My points are as follows

data = [(3.023, 5.138), (3.075, 4.989), (2.321, 5.35), (3.328, 4.944), (3.195, 5.186)]

My objective is to feed all these points into my function and i get the nearest distance for each point. A hypothetical example of the output i am expecting is somewhat like this

[((3.023, 5.138),d1),
 ((3.075, 4.989),d1),
 ((2.321, 5.35),d2),
 ((3.328, 4.944),d1),
 ((3.195, 5.186),d3)]

When i feed individual points into this function it works perfectly, however, when i am trying to do this for multiple points using a lambda function, i am getting none instead of the centroids.

data.map(lambda x:(x,dist((x)))).take(5)

(1) Spark Jobs
Out[17]: [((3.023, 5.138), None),
 ((3.075, 4.989), None),
 ((2.321, 5.35), None),
 ((3.328, 4.944), None),
 ((3.195, 5.186), None)]

What am i doing wrong here? Would appreciate some help.

Answer 1

Your function dist doesn't return anything. It calls the print function, which returns nothing. Naturally, it prints None .

Change the print to return and I suspect you will be happier.

Euclidean Distance between two points on Pyspark

Question

1 answers

solution1
2 ACCPTED 2021-02-20 06:40:22

Euclidean Distance between two points on Pyspark

Question

1 answers

solution1 2 ACCPTED 2021-02-20 06:40:22

solution1
2 ACCPTED 2021-02-20 06:40:22