[英]AWS Comprehend + Pyspark UDF = Error: can't pickle SSLContext objects
应用调用 AWS API 的 Pyspark UDF 时,出现错误
PicklingError: Could not serialize object: TypeError: can't pickle SSLContext objects
代码是
import pyspark.sql.functions as sqlf
import boto3
comprehend = boto3.client('comprehend', region_name='us-east-1')
def detect_sentiment(text):
response = comprehend.detect_sentiment(Text=text, LanguageCode='pt')
return response["SentimentScore"]["Positive"]
detect_sentiment_udf = sqlf.udf(detect_sentiment)
test = df.withColumn("Positive", detect_sentiment_udf(df.Conversa))
其中df.Conversa
包含简短的简单字符串。 请问,我该如何解决这个问题? 或者有什么替代方法?
将 comprehend boto3 客户端添加到 detect_sentiment function 定义中。
当你的 udf 被调用时,它会接收到整个上下文,并且这个上下文需要是可序列化的。 boto 客户端不可序列化,因此您需要在 udf 调用中创建它。
如果您使用对象的方法作为 udf,如下所示,您将得到相同的错误。 要修复它,请为客户端添加一个属性。
class Foo:
def __init__(self):
# this will generate an error when udf is called
self.client = boto3.client('comprehend', region_name='us-east-1')
# do this instead
@property
def client(self):
return boto3.client('comprehend', region_name='us-east-1')
def my_udf(self, text):
response = self.client.detect_sentiment(Text=text, LanguageCode='pt')
return response["SentimentScore"]["Positive"]
def add_sentiment_column(self, df):
detect_sentiment_udf = sqlf.udf(self.my_udf)
return df.withColumn("Positive", detect_sentiment_udf(df.Conversa))
@johnhill2424 的回答将解决您的问题:
import pyspark.sql.functions as sqlf
import boto3
def detect_sentiment(text):
comprehend = boto3.client('comprehend', region_name='us-east-1')
response = comprehend.detect_sentiment(Text=text, LanguageCode='pt')
return response["SentimentScore"]["Positive"]
detect_sentiment_udf = sqlf.udf(detect_sentiment)
test = df.withColumn("Positive", detect_sentiment_udf(df.Conversa))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.