[英]Spark UDF on Python function
I have created a python function for translating short strings using the GCP Translate API.我创建了一个 python 函数,用于使用 GCP Translate API 翻译短字符串。 The codes does something like this.
代码做这样的事情。
def translateString(inputString, targetLanguage, apiKey):
baseUrl = "https://translation.googleapis.com/language/translate/v2?key="
q = "&q="
gcpKey = apiKey
target = "&target="
sentence = str(inputString)
#Finialize request url
url = baseUrl + gcpKey + q + sentence + target
#SEND REQUEST WITH EXPONENTIAL BACK OFF IN CASE OF ERRORS OF EXCEEDING QUOTA LIMITATIONS API
session = requests.Session()
retry = Retry(connect=3, backoff_factor=100)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
respons = session.get(url, timeout=120)
if respons.status_code == 200:
data = respons.json()
translatedStr = data["data"]["translations"][0]["translatedText"]
returnString = str(translatedStr)
return(returnString)
else:
return("Error with code: " + str(respons.status_code))
udfTrans = F.udf(translateString, StringType())
apiKey = *********
dfTempNo = dfToProcess.withColumn("TRANSLATED_FIELD", udfTrans(lit(dfToProcess.FIELD_TO_PROCESS), lit("no"), lit(apiKey)))
This work great when looping through a pd.DataFrame
and storing return variables as we go!当我们循环遍历
pd.DataFrame
并存储返回变量时,这非常pd.DataFrame
! But now I need to apply this function on a spark.DataFrame
so the work can be distributed and have created the following udfTrans = F.udf(translateString, StringType())
so that it can be applied to string
column in spark.DataFrame
.但是现在我需要在
spark.DataFrame
上应用这个函数,以便可以分发工作并创建以下udfTrans = F.udf(translateString, StringType())
以便它可以应用于spark.DataFrame
string
列。
When I run the UDF on dfTempNo = dfToProcess.withColumn("TRANSLATED_FIELD", udfTrans(lit(dfToProcess.FIELD_TO_PROCESS), lit("no"), lit(apiKey)))
it returns no errors but takes forever to run on dfToProcess
with more than 1 row.当我在
dfTempNo = dfToProcess.withColumn("TRANSLATED_FIELD", udfTrans(lit(dfToProcess.FIELD_TO_PROCESS), lit("no"), lit(apiKey)))
上运行dfToProcess
,它不会返回任何错误,但需要永远在dfToProcess
上运行超过 1 行。
I am unsure if I have misunderstood how UDF's are applied to columns in spark.DataFrame
.我不确定我是否误解了 UDF 如何应用于
spark.DataFrame
列。 Is it even possible to apply a function like this to a spark.DataFrame
using a UDF or will I be better off doing this in Python/Pandas?甚至可以使用 UDF 将这样的函数应用于
spark.DataFrame
还是我最好在 Python/Pandas 中执行此操作?
Python udf
s cannot be parallelised like this, because your executor
needs to call back to the driver
for the execution of your udf
. Python
udf
s 不能像这样并行化,因为您的executor
需要回调driver
来执行您的udf
。 This unfortunately means that your udf
is going to be blocking for each row and is essentially serial in its execution.不幸的是,这意味着您的
udf
将阻塞每一行,并且在执行过程中本质上是串行的。
This can be solved more efficiently using different approaches.这可以使用不同的方法更有效地解决。 As your function is heavily IO bound (more specifically network bound), you could look at something like
ThreadPool
implementation, storing your output in a Dict
, then calling SparkContext.parallelize()
on your Dict
and going from there.由于您的函数是大量 IO 绑定的(更具体地说是网络绑定),您可以查看类似
ThreadPool
实现,将您的输出存储在Dict
,然后在您的Dict
上调用SparkContext.parallelize()
并从那里开始。
Alternatively, you could write your udf
in scala
, as it will be automatically parallel in execution.或者,你可以写你的
udf
中scala
,因为这将是在执行自动并行。
Alternatively alternatively, have a look at https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf as pandas
udf
can be vectorized.或者,也可以查看https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf,因为可以矢量化
pandas
udf
。 Hope this helps!希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.