![](/img/trans.png)
[英]Convert PySpark DataFrame column with list in StringType to ArrayType
[英]Pyspark turning list of string into an ArrayType()
我是 pyspark 的新手,我可以使用一些指導。 所以我正在處理一些文本數據,最終我想擺脫在整個語料庫中出現頻率不夠或出現頻率過高的單詞。
數據看起來像這樣,每一行都包含一個句子:
+--------------------+
| cleaned|
+--------------------+
|China halfway com...|
|MCI overhaul netw...|
|script kiddy join...|
|look Microsoft Mo...|
|Americans appear ...|
|Oil Eases Venezue...|
|Americans lose be...|
|explosion Echo Na...|
|Bush tackle refor...|
|jail olympic pool...|
|coyote sign RW Jo...|
|home pc key Windo...|
|bomb defuse Blair...|
|Livermore need ...|
|hat ring fast Wi ...|
|Americans dutch s...|
|Insect Vibrations...|
|Britain sleepwalk...|
|Ron Regan Jr Kind...|
|IBM buy danish fi...|
+--------------------+
所以基本上我使用split()
從pyspark.sql.functions
拆分字符串,然后計算每個單詞的出現次數,提出一些標准並創建需要刪除的單詞列表。
然后我使用以下功能
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def remove_stop_words(list_of_tokens, list_of_stopwords):
'''
A very simple fuction that takes in a list of word tokens and then gets rid of words that are in stopwords list
'''
return [token for token in list_of_tokens if token not in list_of_stopwords]
def udf_remove_stop_words(list_of_stopwords):
'''
creates a udf that takes in a list of stop words and passes them onto remove_stop_words
'''
return udf(lambda x: remove_stop_words(x, list_of_stopwords))
wordsNoStopDF = splitworddf.withColumn('removed', udf_remove_stop_words(list_of_words_to_get_rid)(col('split')))
其中list_of_words_to_get_rid
是我試圖擺脫的單詞列表,該管道的輸入如下所示
+--------------------+
| split|
+--------------------+
|[China, halfway, ...|
|[MCI, overhaul, n...|
|[script, kiddy, j...|
|[look, Microsoft,...|
|[Americans, appea...|
|[Oil, Eases, Vene...|
|[Americans, lose,...|
|[explosion, Echo,...|
|[Bush, tackle, re...|
|[jail, olympic, p...|
+--------------------+
only showing top 10 rows
並且輸出看起來像以下帶有相應架構的
+--------------------+--------------------+
| split| removed|
+--------------------+--------------------+
|[China, halfway, ...|[China, halfway, ...|
|[MCI, overhaul, n...|[MCI, overhaul, n...|
|[script, kiddy, j...|[script, join, fo...|
|[look, Microsoft,...|[look, Microsoft,...|
|[Americans, appea...|[Americans, appea...|
|[Oil, Eases, Vene...|[Oil, Eases, Vene...|
|[Americans, lose,...|[Americans, lose,...|
|[explosion, Echo,...|[explosion, Echo,...|
|[Bush, tackle, re...|[Bush, tackle, re...|
|[jail, olympic, p...|[jail, olympic, p...|
|[coyote, sign, RW...|[coyote, sign, Jo...|
|[home, pc, key, W...|[home, pc, key, W...|
|[bomb, defuse, Bl...|[bomb, defuse, Bl...|
|[Livermore, , , n...|[Livermore, , , n...|
|[hat, ring, fast,...|[hat, ring, fast,...|
|[Americans, dutch...|[Americans, dutch...|
|[Insect, Vibratio...|[tell, Good, Time...|
|[Britain, sleepwa...|[Britain, big, br...|
|[Ron, Regan, Jr, ...|[Ron, Jr, Guy, , ...|
|[IBM, buy, danish...|[IBM, buy, danish...|
+--------------------+--------------------+
root
|-- split: array (nullable = true)
| |-- element: string (containsNull = true)
|-- removed: string (nullable = true)
所以我的問題是如何將removed
的列變成像split
這樣的數組? 我希望使用explode
來計算單詞出現次數,但我似乎無法弄清楚該怎么做。 我嘗試使用regex_replace
去掉括號,然后用,
作為要拆分的模式拆分字符串,但這似乎只向列remove
添加了一個括號remove
。
我是否可以對我使用的函數進行一些更改,讓它們返回一個字符串數組,如列split
。
這里的任何指導將不勝感激!
您還沒有為您的 UDF 定義返回類型, 默認情況下為StringType
,這就是您removed
列是字符串的原因。 您可以像這樣添加使用返回類型
from pyspark.sql import types as T
udf(lambda x: remove_stop_words(x, list_of_stopwords), T.ArrayType(T.StringType()))
您可以更改 UDF 的返回類型。 但是,我建議不要使用任何splited
從類型數組splited
的列中刪除單詞list_of_words_to_get_rid
列表,因為您可以簡單地使用 spark 內置函數array_except
。
下面是一個例子:
import pyspark.sql.functions as F
df = spark.createDataFrame([("a simple sentence containing some words",)], ["cleaned"])
list_of_words_to_get_rid = ["some", "a"]
wordsNoStopDF = df.withColumn(
"split",
F.split("cleaned", " ")
).withColumn(
"removed",
F.array_except(
F.col("split"),
F.array(*[F.lit(w) for w in list_of_words_to_get_rid])
)
).drop("cleaned")
wordsNoStopDF.show(truncate=False)
#+----------------------------------------------+-------------------------------------+
#|split |removed |
#+----------------------------------------------+-------------------------------------+
#|[a, simple, sentence, containing, some, words]|[simple, sentence, containing, words]|
#+----------------------------------------------+-------------------------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.