Python 函数上的 Spark UDF

Question

I have created a python function for translating short strings using the GCP Translate API.我创建了一个 python 函数，用于使用 GCP Translate API 翻译短字符串。 The codes does something like this.代码做这样的事情。

def translateString(inputString, targetLanguage, apiKey):
    baseUrl = "https://translation.googleapis.com/language/translate/v2?key="
    q = "&q="
    gcpKey = apiKey
    target = "&target="
    sentence = str(inputString)

    #Finialize request url
    url = baseUrl + gcpKey + q + sentence + target

    #SEND REQUEST WITH EXPONENTIAL BACK OFF IN CASE OF ERRORS OF EXCEEDING QUOTA LIMITATIONS API
    session = requests.Session()        
    retry = Retry(connect=3, backoff_factor=100)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    respons = session.get(url, timeout=120)

    if respons.status_code == 200:
      data = respons.json()       
      translatedStr = data["data"]["translations"][0]["translatedText"]
      returnString = str(translatedStr)
      return(returnString)

    else:
      return("Error with code: " + str(respons.status_code))

udfTrans = F.udf(translateString, StringType())

apiKey = *********

dfTempNo = dfToProcess.withColumn("TRANSLATED_FIELD", udfTrans(lit(dfToProcess.FIELD_TO_PROCESS), lit("no"), lit(apiKey)))

This work great when looping through a pd.DataFrame and storing return variables as we go!当我们循环遍历pd.DataFrame并存储返回变量时，这非常pd.DataFrame ！ But now I need to apply this function on a spark.DataFrame so the work can be distributed and have created the following udfTrans = F.udf(translateString, StringType()) so that it can be applied to string column in spark.DataFrame .但是现在我需要在spark.DataFrame上应用这个函数，以便可以分发工作并创建以下udfTrans = F.udf(translateString, StringType())以便它可以应用于spark.DataFrame string列。

When I run the UDF on dfTempNo = dfToProcess.withColumn("TRANSLATED_FIELD", udfTrans(lit(dfToProcess.FIELD_TO_PROCESS), lit("no"), lit(apiKey))) it returns no errors but takes forever to run on dfToProcess with more than 1 row.当我在dfTempNo = dfToProcess.withColumn("TRANSLATED_FIELD", udfTrans(lit(dfToProcess.FIELD_TO_PROCESS), lit("no"), lit(apiKey)))上运行dfToProcess ，它不会返回任何错误，但需要永远在dfToProcess上运行超过 1 行。

I am unsure if I have misunderstood how UDF's are applied to columns in spark.DataFrame .我不确定我是否误解了 UDF 如何应用于spark.DataFrame列。 Is it even possible to apply a function like this to a spark.DataFrame using a UDF or will I be better off doing this in Python/Pandas?甚至可以使用 UDF 将这样的函数应用于spark.DataFrame还是我最好在 Python/Pandas 中执行此操作？

Answer 1

Python udf s cannot be parallelised like this, because your executor needs to call back to the driver for the execution of your udf . Python udf s 不能像这样并行化，因为您的executor需要回调driver来执行您的udf 。 This unfortunately means that your udf is going to be blocking for each row and is essentially serial in its execution.不幸的是，这意味着您的udf将阻塞每一行，并且在执行过程中本质上是串行的。

This can be solved more efficiently using different approaches.这可以使用不同的方法更有效地解决。 As your function is heavily IO bound (more specifically network bound), you could look at something like ThreadPool implementation, storing your output in a Dict , then calling SparkContext.parallelize() on your Dict and going from there.由于您的函数是大量 IO 绑定的（更具体地说是网络绑定），您可以查看类似ThreadPool实现，将您的输出存储在Dict ，然后在您的Dict上调用SparkContext.parallelize()并从那里开始。

Alternatively, you could write your udf in scala , as it will be automatically parallel in execution.或者，你可以写你的udf中scala ，因为这将是在执行自动并行。

Alternatively alternatively, have a look at https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf as pandas udf can be vectorized.或者，也可以查看https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf，因为可以矢量化pandas udf 。 Hope this helps!希望这可以帮助！

Python 函数上的 Spark UDF

问题描述

1 个解决方案

解决方案1
2 2019-11-25 11:29:59

Python 函数上的 Spark UDF

问题描述

1 个解决方案

解决方案1 2 2019-11-25 11:29:59

解决方案1
2 2019-11-25 11:29:59