[英]Pyspark turning list of string into an ArrayType()
I am a bit of a novice with pyspark and I could use some guidance.我是 pyspark 的新手,我可以使用一些指导。 So I'm working with some text data and ultimately I want to get rid of words that either don't appear often enough in the entire corpus, or appear too often.
所以我正在处理一些文本数据,最终我想摆脱在整个语料库中出现频率不够或出现频率过高的单词。
The data looks something like this with each row containing a sentence:数据看起来像这样,每一行都包含一个句子:
+--------------------+
| cleaned|
+--------------------+
|China halfway com...|
|MCI overhaul netw...|
|script kiddy join...|
|look Microsoft Mo...|
|Americans appear ...|
|Oil Eases Venezue...|
|Americans lose be...|
|explosion Echo Na...|
|Bush tackle refor...|
|jail olympic pool...|
|coyote sign RW Jo...|
|home pc key Windo...|
|bomb defuse Blair...|
|Livermore need ...|
|hat ring fast Wi ...|
|Americans dutch s...|
|Insect Vibrations...|
|Britain sleepwalk...|
|Ron Regan Jr Kind...|
|IBM buy danish fi...|
+--------------------+
So essentially I split the strings using split()
from pyspark.sql.functions
, and then count the occurrence of each words, come up with some criteria and create a list of words that need to be deleted.所以基本上我使用
split()
从pyspark.sql.functions
拆分字符串,然后计算每个单词的出现次数,提出一些标准并创建需要删除的单词列表。
I then use the following functions然后我使用以下功能
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def remove_stop_words(list_of_tokens, list_of_stopwords):
'''
A very simple fuction that takes in a list of word tokens and then gets rid of words that are in stopwords list
'''
return [token for token in list_of_tokens if token not in list_of_stopwords]
def udf_remove_stop_words(list_of_stopwords):
'''
creates a udf that takes in a list of stop words and passes them onto remove_stop_words
'''
return udf(lambda x: remove_stop_words(x, list_of_stopwords))
wordsNoStopDF = splitworddf.withColumn('removed', udf_remove_stop_words(list_of_words_to_get_rid)(col('split')))
where list_of_words_to_get_rid
is a list of words I'm trying to get rid of and the input to this pipeline looks as follows其中
list_of_words_to_get_rid
是我试图摆脱的单词列表,该管道的输入如下所示
+--------------------+
| split|
+--------------------+
|[China, halfway, ...|
|[MCI, overhaul, n...|
|[script, kiddy, j...|
|[look, Microsoft,...|
|[Americans, appea...|
|[Oil, Eases, Vene...|
|[Americans, lose,...|
|[explosion, Echo,...|
|[Bush, tackle, re...|
|[jail, olympic, p...|
+--------------------+
only showing top 10 rows
and the output looks like the following with the corresponding schema并且输出看起来像以下带有相应架构的
+--------------------+--------------------+
| split| removed|
+--------------------+--------------------+
|[China, halfway, ...|[China, halfway, ...|
|[MCI, overhaul, n...|[MCI, overhaul, n...|
|[script, kiddy, j...|[script, join, fo...|
|[look, Microsoft,...|[look, Microsoft,...|
|[Americans, appea...|[Americans, appea...|
|[Oil, Eases, Vene...|[Oil, Eases, Vene...|
|[Americans, lose,...|[Americans, lose,...|
|[explosion, Echo,...|[explosion, Echo,...|
|[Bush, tackle, re...|[Bush, tackle, re...|
|[jail, olympic, p...|[jail, olympic, p...|
|[coyote, sign, RW...|[coyote, sign, Jo...|
|[home, pc, key, W...|[home, pc, key, W...|
|[bomb, defuse, Bl...|[bomb, defuse, Bl...|
|[Livermore, , , n...|[Livermore, , , n...|
|[hat, ring, fast,...|[hat, ring, fast,...|
|[Americans, dutch...|[Americans, dutch...|
|[Insect, Vibratio...|[tell, Good, Time...|
|[Britain, sleepwa...|[Britain, big, br...|
|[Ron, Regan, Jr, ...|[Ron, Jr, Guy, , ...|
|[IBM, buy, danish...|[IBM, buy, danish...|
+--------------------+--------------------+
root
|-- split: array (nullable = true)
| |-- element: string (containsNull = true)
|-- removed: string (nullable = true)
So my question is how do I turn the column removed
into an array like split
?所以我的问题是如何将
removed
的列变成像split
这样的数组? I'm hoping to use explode
to count word occurrence, but I can't seem to quite figure out what to do.我希望使用
explode
来计算单词出现次数,但我似乎无法弄清楚该怎么做。 I've tried to use regex_replace
to get rid of the brackets, and then split the string with ,
as pattern to split on, but that seem to only add a bracket to the column remove
.我尝试使用
regex_replace
去掉括号,然后用,
作为要拆分的模式拆分字符串,但这似乎只向列remove
添加了一个括号remove
。
Is there some change I can make to the functions I'm using to have them return an array of string like the column split
.我是否可以对我使用的函数进行一些更改,让它们返回一个字符串数组,如列
split
。
Any guidance here would be greatly appreciated!这里的任何指导将不胜感激!
You haven't define a return type for your UDF, which is StringType
by default , that's why you got removed
column is is a string.您还没有为您的 UDF 定义返回类型, 默认情况下为
StringType
,这就是您removed
列是字符串的原因。 You can add use return type like so您可以像这样添加使用返回类型
from pyspark.sql import types as T
udf(lambda x: remove_stop_words(x, list_of_stopwords), T.ArrayType(T.StringType()))
You can change the return type of your UDF.您可以更改 UDF 的返回类型。 However, I'd suggest NOT to use any udf to remove list of word
list_of_words_to_get_rid
from the column splited
of type array, as you can simply use the spark built-in function array_except
.但是,我建议不要使用任何
splited
从类型数组splited
的列中删除单词list_of_words_to_get_rid
列表,因为您可以简单地使用 spark 内置函数array_except
。
Here's an example:下面是一个例子:
import pyspark.sql.functions as F
df = spark.createDataFrame([("a simple sentence containing some words",)], ["cleaned"])
list_of_words_to_get_rid = ["some", "a"]
wordsNoStopDF = df.withColumn(
"split",
F.split("cleaned", " ")
).withColumn(
"removed",
F.array_except(
F.col("split"),
F.array(*[F.lit(w) for w in list_of_words_to_get_rid])
)
).drop("cleaned")
wordsNoStopDF.show(truncate=False)
#+----------------------------------------------+-------------------------------------+
#|split |removed |
#+----------------------------------------------+-------------------------------------+
#|[a, simple, sentence, containing, some, words]|[simple, sentence, containing, words]|
#+----------------------------------------------+-------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.