String matching function between two columns using Levenshtein distance in PySpark

Question

I am trying to compare pairs of names by converting the levenshtein distance between them to a matching coef such as :

coef = 1 - Levenstein(str1, str2) / max(length(str1) , length(str2))

However, when I implement it in PySpark using withColumn(), I get errors whe computing the max() function. Both numpy.max and pyspark.sql.functions.max throw errors. Any idea ?

from pyspark.sql.functions import col, length, levenshtein

valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])

test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / max(length(col('firstname')), length(col('firstname2'))))

Answer 1

max is an aggregate function, to find greatest between two values you want to use greatest , also from pyspark.sql.functions

from pyspark.sql.functions import col, length, greatest
from pyspark.sql.functions import levenshtein  
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])

test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / greatest(length(col('firstname')), length(col('firstname2')))).show()

String matching function between two columns using Levenshtein distance in PySpark

Question

1 answers

solution1
0 ACCPTED 2019-09-05 12:47:05

String matching function between two columns using Levenshtein distance in PySpark

Question

1 answers

solution1 0 ACCPTED 2019-09-05 12:47:05

solution1
0 ACCPTED 2019-09-05 12:47:05