AWS Comprehend + Pyspark UDF = 错误：无法腌制 SSLContext 对象

Question

When applying a Pyspark UDF that calls an AWS API, I get the error应用调用 AWS API 的 Pyspark UDF 时，出现错误

PicklingError: Could not serialize object: TypeError: can't pickle SSLContext objects

The code is代码是

import pyspark.sql.functions as sqlf
import boto3

comprehend = boto3.client('comprehend', region_name='us-east-1')

def detect_sentiment(text):
  response = comprehend.detect_sentiment(Text=text, LanguageCode='pt')
  return response["SentimentScore"]["Positive"]

detect_sentiment_udf = sqlf.udf(detect_sentiment)

test = df.withColumn("Positive", detect_sentiment_udf(df.Conversa))

Where df.Conversa contains short simple strings.其中df.Conversa包含简短的简单字符串。 Please, how can I solve this?请问，我该如何解决这个问题？ Or what could be an alternative approach?或者有什么替代方法？

Answer 1

Add the comprehend boto3 client into the detect_sentiment function definition.将 comprehend boto3 客户端添加到 detect_sentiment function 定义中。

Answer 2

When your udf is called, it receives the entire context, and this context needs to be serializable.当你的 udf 被调用时，它会接收到整个上下文，并且这个上下文需要是可序列化的。 The boto client is NOT serializable, so you need to create it within your udf call. boto 客户端不可序列化，因此您需要在 udf 调用中创建它。

If you are using an object's method as udf, such as below, you will get the same error.如果您使用对象的方法作为 udf，如下所示，您将得到相同的错误。 To fix it, add a property for the client.要修复它，请为客户端添加一个属性。

class Foo:
    def __init__(self):
        # this will generate an error when udf is called
        self.client = boto3.client('comprehend', region_name='us-east-1')

    # do this instead
    @property
    def client(self):
        return boto3.client('comprehend', region_name='us-east-1')

    def my_udf(self, text):
        response = self.client.detect_sentiment(Text=text, LanguageCode='pt')
        return response["SentimentScore"]["Positive"]

    def add_sentiment_column(self, df):
        detect_sentiment_udf = sqlf.udf(self.my_udf)
        return df.withColumn("Positive", detect_sentiment_udf(df.Conversa))

@johnhill2424's answer will fix the problem in your case: @johnhill2424 的回答将解决您的问题：

import pyspark.sql.functions as sqlf
import boto3

def detect_sentiment(text):
  comprehend = boto3.client('comprehend', region_name='us-east-1')
  response = comprehend.detect_sentiment(Text=text, LanguageCode='pt')
  return response["SentimentScore"]["Positive"]

detect_sentiment_udf = sqlf.udf(detect_sentiment)

test = df.withColumn("Positive", detect_sentiment_udf(df.Conversa))

AWS Comprehend + Pyspark UDF = 错误：无法腌制 SSLContext 对象

问题描述

2 个解决方案

解决方案1
0 2021-04-06 19:16:40

解决方案2
0 2021-08-09 15:56:20

AWS Comprehend + Pyspark UDF = 错误：无法腌制 SSLContext 对象

问题描述

2 个解决方案

解决方案1 0 2021-04-06 19:16:40

解决方案2 0 2021-08-09 15:56:20

解决方案1
0 2021-04-06 19:16:40

解决方案2
0 2021-08-09 15:56:20