在 pyspark 中用空格分隔字符串

Question

I have column with search queries that are represented by strings.我有一列包含由字符串表示的搜索查询。 I want to separate every string to different work.我想将每个字符串分成不同的工作。

Let say I have this data frame:假设我有这个数据框：

import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
  
df = spark.read.option("header", "true") \
    .option("delimiter", "\t") \
    .option("inferSchema", "true") \
    .csv("/content/drive/MyDrive/my_data.txt")
    



data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))

from pyspark.sql.functions import array_distinct

from pyspark.sql.functions import udf

data = data.withColumn("New_Data", array_distinct("Query"))

Z = data.drop(data.Query)

+------+------------------------+
|AnonID|            New_Data    |
+------+------------------------+
|   142|[Big House, Green frog] |
+------+------------------------+

And I want output like that:我想要这样的 output：

+------+--------------------------+
|AnonID|            New_Data      |
+------+--------------------------+
|   142|[Big, House, Green, frog] |
+------+--------------------------+

I have tried to search In older posts but I was able to find only something that separates each word to different column and it's not what I want.我试图在较旧的帖子中搜索，但我只能找到将每个单词分隔到不同列的内容，这不是我想要的。

Answer 1

To separate the elements in an array and split each string into separate words, you can use the explode and split functions in Spark.要分隔数组中的元素并将每个字符串拆分为单独的单词，您可以使用 Spark 中的 explode 和 split 函数。 The exploded elements can then be combined back into an array using the array function.然后可以使用数组 function 将分解的元素组合回数组。

from pyspark.sql.functions import explode, split, array

data = data.withColumn("Words", explode(split(data["New_Data"], " ")))
data = data.groupBy("AnonID").agg(array(data["Words"]).alias("New_Data"))

Answer 2

You can do the collect_list first and then use the transform function to split the array elements and then flatten the elements and then finally apply array_distinct.您可以先执行 collect_list，然后使用变换 function 拆分数组元素，然后展平元素，最后应用 array_distinct。 Please check out the code and output below.请查看下面的代码和output。

df=spark.createDataFrame([[142,"Big House"],[142,"Big Green Frog"]],["AnonID","Query"])

import pyspark.sql.functions as F
data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))

data.withColumn("Query",F.array_distinct(flatten(transform(data["Query"], lambda x: split(x, " "))))).show(2,False)

+------+-------------------------+
|AnonID|Query                    |
+------+-------------------------+
|142   |[Big, House, Green, Frog]|
+------+-------------------------+

在 pyspark 中用空格分隔字符串

问题描述

2 个解决方案

解决方案1
0 2023-01-30 20:43:11

解决方案2
0 2023-01-30 22:54:54

在 pyspark 中用空格分隔字符串

问题描述

2 个解决方案

解决方案1 0 2023-01-30 20:43:11

解决方案2 0 2023-01-30 22:54:54

解决方案1
0 2023-01-30 20:43:11

解决方案2
0 2023-01-30 22:54:54