正则表达式模式从 pyspark 中的单词中删除数值

Question

I am working on pyspark dataframe and I have a column of words (array<string> type) .我正在研究 pyspark dataframe 并且我有一列words (array<string> type) 。 What should be the regex pattern to remove numeric values and numeric values from words?从单词中删除数值和数值的正则表达式模式应该是什么？

+---+----------------------------------------------+
|id |    words                                     |
+---+----------------------------------------------+
|564|[fhbgtrj5, 345gjhg, ghth578ghu, 5897, fhrfu44]|
+---+----------------------------------------------+

expected output:预期 output：

+---+----------------------------------------------+
|id |words                                         |
+---+----------------------------------------------+
|564|               [fhbgtrj, gjhg, ghthghu, fhrfu]|
+---+----------------------------------------------+

Please help.请帮忙。

Answer 1

You can use transform together with regexp_replace to remove the numbers, and use array_remove to remove the empty entries (which comes from those entries which only consist of numbers).您可以将transform与regexp_replace一起使用来删除数字，并使用array_remove删除空条目（来自那些仅由数字组成的条目）。

df2 = df.withColumn(
    'words', 
    F.expr("array_remove(transform(words, x -> regexp_replace(x, '[0-9]', '')), '') as words")
)

df2.show(truncate=False)
+---+-------------------------------+
|id |words                          |
+---+-------------------------------+
|564|[fhbgtrj, gjhg, ghthghu, fhrfu]|
+---+-------------------------------+

正则表达式模式从 pyspark 中的单词中删除数值

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-03-26 07:27:11

正则表达式模式从 pyspark 中的单词中删除数值

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-03-26 07:27:11

解决方案1
1 已采纳 2021-03-26 07:27:11