[英]Remove the repeated punctuation from pyspark dataframe
I need to remove the repeated punctuations and keep the last occurrence only.我需要删除重复的标点符号并仅保留最后一次出现。
For example: !!!! -> !
!!$$ -> !$
I have a dataset that looks like below我有一个如下所示的数据集
temp = spark.createDataFrame([
(0, "This is Spark!!!!"),
(1, "I wish Java could use case classes!!##"),
(2, "Data science is cool#$@!"),
(3, "Machine!!$$")
], ["id", "words"])
+---+--------------------------------------+
|id |words |
+---+--------------------------------------+
|0 |This is Spark!!!! |
|1 |I wish Java could use case classes!!##|
|2 |Data science is cool#$@! |
|3 |Machine!!$$ |
+---+--------------------------------------+
I tried regex to remove specific punctuations and that is below我尝试使用正则表达式来删除特定的标点符号,如下所示
df2 = temp.select(
[F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)
but the above is not working.但以上不起作用。 Can anyone tell how to achieve this in pyspark?
谁能告诉如何在 pyspark 中实现这一目标?
Below is the desired output.以下是所需的输出。
id words
0 0 This is Spark!
1 1 I wish Java could use case classes!#
2 2 Data science is cool#$@!
3 3 Machine!$
I am doing the solution in Scala since I do not have the environment for Python, but it is almost the same;我在 Scala 中做解决方案,因为我没有 Python 的环境,但几乎是一样的;
First, I declared an array of symbols you want to replace:首先,我声明了一个要替换的符号数组:
val puncs = Array("!", "@", "#", "$", "%", "^", "&")
then, I loop through these and modify the dataset, as:然后,我遍历这些并修改数据集,如下所示:
puncs.foreach(punc => {
ds1 = ds1
.withColumn("words",regexp_replace(col("words"), s"(\\${punc}+)", punc))
})
The final result:最终结果:
+---+------------------------------------+
|id |words |
+---+------------------------------------+
|0 |This is Spark! |
|1 |I wish Java could use case classes!#|
+---+------------------------------------+
I can not think of a better solution for now because the replacing has to be the same character.我现在想不出更好的解决方案,因为替换必须是相同的字符。
You can use this regex.您可以使用此正则表达式。
df2 = temp.select('id',
F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))
Add more characters between []
for your needs.在
[]
之间添加更多字符以满足您的需要。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.