在 Pyspark Dataframe 的列上應用 UDF 時出現問題

Question

我的目標是清理 Pyspark DF 中列中的數據。 我寫了一個 function 來清洗。

def preprocess(text):
    text = text.lower() 
    text=text.strip()  
    text=re.compile('<.*?>').sub('', text) 
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  
    text = re.sub('\s+', ' ', text)  
    text = re.sub(r'\[[0-9]*\]',' ',text) 
    text=re.sub(r'[^\w\s]', '', text.lower().strip())
    text = re.sub(r'\d',' ',text) 
    text = re.sub(r'\s+',' ',text) 
    return text

 

#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()

stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    text = [i for i in text.split() if not i in stop_words]
    return text
 
# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
# Tokenize the sentence
def lemmatizer(string):
    word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
    a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
    return " ".join(a)

#Final Function
def finalpreprocess(string):
    return lemmatizer(' '.join(remove_stopwords(preprocess(string))))

當我測試它時，這些功能似乎工作正常。 當我做

text = 'Ram and Bheem are buddies. They (both) like <b>running</b>. They got better at it over the weekend'

print(finalpreprocess(text))

我看到了我想要的確切結果。

ram bheem buddy like run get well weekend

但是，當我嘗試將此 function finalpreprocess() 應用於 pyspark dataframe 中的列時。 我收到錯誤。 這就是我所做的。

udf_txt_clean = udf(lambda x: finalpreprocess(x),StringType()) df.withColumn("cleaned_text",lem(col("reason"))).select("reason","cleaned_text").show(10,錯誤的）

然后我收到錯誤：

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 473, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object

到目前為止，這就是我所做的。 在我的 finalpreprocess() 中，我使用了三個不同的函數 preprocess()、remove_stopwords()、lemmatizer()。 我相應地更改了我的 udf_txt_clean 。 喜歡

udf_txt_clean = udf(lambda x: preprocess(x),StringType())
udf_txt_clean = udf(lambda x: remove_stopwords(x),StringType())

這兩個運行良好但是 -

udf_txt_clean = udf(lambda x: lemmatizer (x),StringType())

是給我錯誤的那個。 我不明白為什么這個 function 給出了錯誤，而其他兩個沒有。 從我有限的理解中，我看到它在嘗試腌制這個 function 時遇到了麻煩，但我無法理解為什么它首先要嘗試腌制它，或者是否有解決方法。

Answer 1

如果下次該示例更具可重復性，那將有所幫助。 重新創建它需要一些時間。 不過不用擔心，我在這里有一個解決方案。

首先， cloudpickle是 Spark 將 function 從驅動程序轉移到工作人員的機制。 所以函數被腌制，然后發送給工人執行。 所以你正在使用的東西不能被腌制。 為了快速測試，您可以使用：

import cloudpickle
cloudpickle.dumps(x)

其中 x 是您正在測試的東西，如果它是 cloudpickle-able 的話。 在這種情況下，我嘗試了幾次，發現wordnet不可序列化。 您可以使用以下方法進行測試：

cloudpickle.dumps(wordnet)

它將重現該問題。 您可以通過導入 function 中無法腌制的東西來解決這個問題。 這是一個端到端的例子。

import re
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType,IntegerType,StringType

def preprocess(text):
    text = text.lower() 
    text=text.strip()  
    text=re.compile('<.*?>').sub('', text) 
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  
    text = re.sub('\s+', ' ', text)  
    text = re.sub(r'\[[0-9]*\]',' ',text) 
    text=re.sub(r'[^\w\s]', '', text.lower().strip())
    text = re.sub(r'\d',' ',text) 
    text = re.sub(r'\s+',' ',text) 
    return text


#LEMMATIZATION
# Initialize the lemmatizer
wl = WordNetLemmatizer()

stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    text = [i for i in text.split() if not i in stop_words]
    return text
 
def lemmatizer(string):
    from nltk.corpus import wordnet
    def get_wordnet_pos(tag):
        if tag.startswith('J'):
            return wordnet.ADJ
        elif tag.startswith('V'):
            return wordnet.VERB
        elif tag.startswith('N'):
            return wordnet.NOUN
        elif tag.startswith('R'):
            return wordnet.ADV
        else:
            return wordnet.NOUN
    word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
    a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
    return " ".join(a)

#Final Function
def finalpreprocess(string):
    return lemmatizer(' '.join(remove_stopwords(preprocess(string))))

spark = SparkSession.builder.getOrCreate()
text = 'Ram and Bheem are buddies. They (both) like <b>running</b>. They got better at it over the weekend'
test = pd.DataFrame({"test": [text]})
sdf = spark.createDataFrame(test)
udf_txt_clean = udf(lambda x: finalpreprocess(x),StringType())
sdf.withColumn("cleaned_text",udf_txt_clean(col("test"))).select("test","cleaned_text").show(10,False)

在 Pyspark Dataframe 的列上應用 UDF 時出現問題

問題描述

1 個解決方案

解決方案1
0 2022-08-04 00:37:13

在 Pyspark Dataframe 的列上應用 UDF 時出現問題

問題描述

1 個解決方案

解決方案1 0 2022-08-04 00:37:13

解決方案1
0 2022-08-04 00:37:13