简体   繁体   English

pyspark:具有用户定义函数 (UDF) 的 StopWordsRemover

[英]pyspark: StopWordsRemover with user defined functions (UDF)

I have a dataframe consisting of text and languages我有一个由文本和语言组成的数据框

sf = spark.createDataFrame([
    ('eng', "I saw the red balloon"),
    ('eng', 'She was drinking tea from a black mug'),
    ('ger','Er ging heute sehr weit'),
    ('ger','Ich habe dich seit hundert Jahren nicht mehr gesehen')
], ["lang", "text"])
display(sf)

Output:输出:

+----+--------------------+
|lang|                text|
+----+--------------------+
| eng|I saw the red bal...|
| eng|She was drinking ...|
| ger|Er ging heute seh...|
| ger|Ich habe dich sei...|
+----+--------------------+

I want to remove the stop word for each text, for this I create a dictionary:我想删除每个文本的停用词,为此我创建了一个字典:

from pyspark.ml.feature import StopWordsRemover

ger_stopwords = StopWordsRemover.loadDefaultStopWords("german")
eng_stopwords = StopWordsRemover.loadDefaultStopWords("english")
stopwords = {'eng':eng_stopwords,
            'ger':ger_stopwords}

And now I don't understand how can I apply stop words to a col('text') using udf ?现在我不明白如何使用udf将停用词应用于 col('text') ? Because transform() will not suit me in this case因为在这种情况下 transform() 不适合我

I do not know exactly how to use StopWordsRemover, but based on what you did and on the documentation, I can offer this solution (without UDFs):我不知道如何使用 StopWordsRemover,但根据你所做的和文档,我可以提供这个解决方案(没有 UDF):

from functools import reduce


df = reduce(
    lambda a, b: a.unionAll(b),
    (
        StopWordsRemover(
            inputCol="splitted_text", outputCol="words", stopWords=value
        ).transform(
            sf.where(F.col("lang") == key).withColumn(
                "splitted_text", F.split("text", " ")
            )
        )
        for key, value in stopwords.items()
    ),
)

df.show()
+----+----------------------------------------------------+--------------------------------------------------------------+--------------------------------------+
|lang|text                                                |splitted_text                                                 |words                                 |
+----+----------------------------------------------------+--------------------------------------------------------------+--------------------------------------+
|eng |I saw the red balloon                               |[I, saw, the, red, balloon]                                   |[saw, red, balloon]                   |
|eng |She was drinking tea from a black mug               |[She, was, drinking, tea, from, a, black, mug]                |[drinking, tea, black, mug]           |
|ger |Er ging heute sehr weit                             |[Er, ging, heute, sehr, weit]                                 |[ging, heute, weit]                   |
|ger |Ich habe dich seit hundert Jahren nicht mehr gesehen|[Ich, habe, dich, seit, hundert, Jahren, nicht, mehr, gesehen]|[seit, hundert, Jahren, mehr, gesehen]|
+----+----------------------------------------------------+--------------------------------------------------------------+--------------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM