简体   繁体   中英

Creating a bidimensional array with Spark (pyspark)

Working with Python 2.7 in Spark, I have two lists of points in 2 dimensions. List A has n points and list B has m points. Each point is represented by a list of 2 elements (x and y coordinates):

set_a = [[x1, y1], [x2, y2], ..., [xn, yn]]
set_b = [[x1, y1], [x2, y2], ..., [xm, ym]]

I would like to build a n*m matrix M where the generic element M[i][j] contains the distance between the point in A with index i and the point in B with index j . I'm not talking about the euclidean distance, but I have my personal_distance_function(point_a, point_b) which I would like to use for the building of M .

In pure Python 2.7 I'm currently doing something like this:

for i in range(len(A)):
    for j in range(len(B)):
        M[i, j] = personal_distance_function(A[i], B[j])

... but since I need to do this with pyspark, do you have any suggestion on how to do it using SparkContext?

First you need to convert your lists into dataframes:

>>> df_a = spark.createDataFrame(set_a, ['a_x', 'a_y'])
>>> df_b = spark.createDataFrame(set_b, ['b_x', 'b_y'])

Then you need to create an UDF (user defined function) to register your function in spark:

>>> from pyspark.sql.functions import udf, struct
>>> from pyspark.sql.types import DoubleType
>>> dist = udf(personal_distance_function, DoubleType())

Finally, you can use simple spark code to cross-join both dataframes and execute a distance function on them:

>>> df_a.crossJoin(df_b) \
      .withColumn('dist', dist(struct('a_x', 'a_y'), struct('b_x', 'b_y'))).show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM