简体   繁体   English

从 pyspark 数据框中删除重复的标点符号

[英]Remove the repeated punctuation from pyspark dataframe

I need to remove the repeated punctuations and keep the last occurrence only.我需要删除重复的标点符号并仅保留最后一次出现。

For example: !!!! -> !
             !!$$ -> !$

I have a dataset that looks like below我有一个如下所示的数据集

temp = spark.createDataFrame([
    (0, "This is Spark!!!!"),
    (1, "I wish Java could use case classes!!##"),
    (2, "Data science is  cool#$@!"),
    (3, "Machine!!$$")
], ["id", "words"])

+---+--------------------------------------+
|id |words                                 |
+---+--------------------------------------+
|0  |This is Spark!!!!                     |
|1  |I wish Java could use case classes!!##|
|2  |Data science is  cool#$@!             |
|3  |Machine!!$$                             |
+---+--------------------------------------+

I tried regex to remove specific punctuations and that is below我尝试使用正则表达式来删除特定的标点符号,如下所示

df2 = temp.select(
    [F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)

but the above is not working.但以上不起作用。 Can anyone tell how to achieve this in pyspark?谁能告诉如何在 pyspark 中实现这一目标?

Below is the desired output.以下是所需的输出。

    id  words
0   0   This is Spark!
1   1   I wish Java could use case classes!#
2   2   Data science is cool#$@!
3   3   Machine!$

I am doing the solution in Scala since I do not have the environment for Python, but it is almost the same;我在 Scala 中做解决方案,因为我没有 Python 的环境,但几乎是一样的;

First, I declared an array of symbols you want to replace:首先,我声明了一个要替换的符号数组:

val puncs = Array("!", "@", "#", "$", "%", "^", "&")

then, I loop through these and modify the dataset, as:然后,我遍历这些并修改数据集,如下所示:

puncs.foreach(punc => {
  ds1 = ds1
    .withColumn("words",regexp_replace(col("words"), s"(\\${punc}+)", punc))
})

The final result:最终结果:

+---+------------------------------------+
|id |words                               |
+---+------------------------------------+
|0  |This is Spark!                      |
|1  |I wish Java could use case classes!#|
+---+------------------------------------+

I can not think of a better solution for now because the replacing has to be the same character.我现在想不出更好的解决方案,因为替换必须是相同的字符。

You can use this regex.您可以使用此正则表达式。

df2 = temp.select('id',
    F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))

Add more characters between [] for your needs.[]之间添加更多字符以满足您的需要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM