[英]Regex pattern to remove numeric value from words in pyspark
I am working on pyspark dataframe and I have a column of words
(array<string> type)
.我正在研究 pyspark dataframe 并且我有一列
words
(array<string> type)
。 What should be the regex pattern to remove numeric values and numeric values from words?从单词中删除数值和数值的正则表达式模式应该是什么?
+---+----------------------------------------------+
|id | words |
+---+----------------------------------------------+
|564|[fhbgtrj5, 345gjhg, ghth578ghu, 5897, fhrfu44]|
+---+----------------------------------------------+
expected output:预期 output:
+---+----------------------------------------------+
|id |words |
+---+----------------------------------------------+
|564| [fhbgtrj, gjhg, ghthghu, fhrfu]|
+---+----------------------------------------------+
Please help.请帮忙。
You can use transform
together with regexp_replace
to remove the numbers, and use array_remove
to remove the empty entries (which comes from those entries which only consist of numbers).您可以将
transform
与regexp_replace
一起使用来删除数字,并使用array_remove
删除空条目(来自那些仅由数字组成的条目)。
df2 = df.withColumn(
'words',
F.expr("array_remove(transform(words, x -> regexp_replace(x, '[0-9]', '')), '') as words")
)
df2.show(truncate=False)
+---+-------------------------------+
|id |words |
+---+-------------------------------+
|564|[fhbgtrj, gjhg, ghthghu, fhrfu]|
+---+-------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.