从 pyspark 数据框中删除重复的标点符号

Question

I need to remove the repeated punctuations and keep the last occurrence only.我需要删除重复的标点符号并仅保留最后一次出现。

For example: !!!! -> !
             !!$$ -> !$

I have a dataset that looks like below我有一个如下所示的数据集

temp = spark.createDataFrame([
    (0, "This is Spark!!!!"),
    (1, "I wish Java could use case classes!!##"),
    (2, "Data science is  cool#$@!"),
    (3, "Machine!!$$")
], ["id", "words"])

+---+--------------------------------------+
|id |words                                 |
+---+--------------------------------------+
|0  |This is Spark!!!!                     |
|1  |I wish Java could use case classes!!##|
|2  |Data science is  cool#$@!             |
|3  |Machine!!$$                             |
+---+--------------------------------------+

I tried regex to remove specific punctuations and that is below我尝试使用正则表达式来删除特定的标点符号，如下所示

df2 = temp.select(
    [F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)

but the above is not working.但以上不起作用。 Can anyone tell how to achieve this in pyspark?谁能告诉如何在 pyspark 中实现这一目标？

Below is the desired output.以下是所需的输出。

    id  words
0   0   This is Spark!
1   1   I wish Java could use case classes!#
2   2   Data science is cool#$@!
3   3   Machine!$

Answer 1

I am doing the solution in Scala since I do not have the environment for Python, but it is almost the same;我在 Scala 中做解决方案，因为我没有 Python 的环境，但几乎是一样的；

First, I declared an array of symbols you want to replace:首先，我声明了一个要替换的符号数组：

val puncs = Array("!", "@", "#", "$", "%", "^", "&")

then, I loop through these and modify the dataset, as:然后，我遍历这些并修改数据集，如下所示：

puncs.foreach(punc => {
  ds1 = ds1
    .withColumn("words",regexp_replace(col("words"), s"(\\${punc}+)", punc))
})

The final result:最终结果：

+---+------------------------------------+
|id |words                               |
+---+------------------------------------+
|0  |This is Spark!                      |
|1  |I wish Java could use case classes!#|
+---+------------------------------------+

I can not think of a better solution for now because the replacing has to be the same character.我现在想不出更好的解决方案，因为替换必须是相同的字符。

Answer 2

You can use this regex.您可以使用此正则表达式。

df2 = temp.select('id',
    F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))

Add more characters between [] for your needs.在[]之间添加更多字符以满足您的需要。

从 pyspark 数据框中删除重复的标点符号

问题描述

1 个解决方案

解决方案1
0 2022-07-22 14:06:54

解决方案2
0 2022-07-22 14:14:09

从 pyspark 数据框中删除重复的标点符号

问题描述

1 个解决方案

解决方案1 0 2022-07-22 14:06:54

解决方案2 0 2022-07-22 14:14:09

解决方案1
0 2022-07-22 14:06:54

解决方案2
0 2022-07-22 14:14:09