简体   繁体   中英

Use external library in pandas_udf in pyspark

It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I have tried with Spark version 2.3.1.

You can package the textdistance together with your own code (use setup.py and bdist_egg to build an egg file), and specify the final package with option --py-files while you run spark.

btw, the error message doesn't seem to relate with textdistance at all.

You can use a Spark UDF, for example to implement the Ratcliff-Obershelp function:

import textdistance

def my_ro(s1,s2):
  d = textdistance.ratcliff_obershelp(s1,s2)
  return d

spark.udf.register("my_ro", my_ro, FloatType())

spark.sql("SELECT word1, word2, my_ro(word1,word2) as ro FROM spark_df")\
.show(100,False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM