pyspark 数据框：删除数组列中的重复项

Question

我想删除 pyspark 数据帧列中的一些重复单词。

我的火花：

  2.4.5

py3代码：

  test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
  t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.

  t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
  t5 = t4.withColumn('text', F.array_distinct("text"))
  t5.show(1, 120)

但得到

 +--------------------------------------------------------+
 |                                                    text| 
 +--------------------------------------------------------+
 |[i like this book and this book be downloaded on line]|
 +--------------------------------------------------------+

我需要删除

 book and this

似乎“array_distinct”无法过滤掉它们？

谢谢

Answer 1

您可以使用 pyspark sql.functions lcase 、 split 、 array_distinct和array_join函数

例如， F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")

这是工作代码

import pyspark.sql.functions as F
df
.withColumn("text_new",
   F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")) \
.show(truncate=False)

说明：

在这里，您首先使用lcase(text)将所有内容转换为小写，然后使用split(text,' ')在空格上split(text,' ')数组，这会产生

[i, like, this, book, and, this, book, be, downloaded, on, line]|

然后你把它传递给array_distinct ，它产生

[i, like, this, book, and, be, downloaded, on, line]

最后，使用array_join将其与空格array_join

i like this book and be downloaded on line

pyspark 数据框：删除数组列中的重复项

问题描述

1 个解决方案

解决方案1
0 2020-09-15 09:17:47

pyspark 数据框：删除数组列中的重复项

问题描述

1 个解决方案

解决方案1 0 2020-09-15 09:17:47

解决方案1
0 2020-09-15 09:17:47