简体   繁体   English

在 pyspark 中用空格分隔字符串

[英]Separate string by white space in pyspark

I have column with search queries that are represented by strings.我有一列包含由字符串表示的搜索查询。 I want to separate every string to different work.我想将每个字符串分成不同的工作。

Let say I have this data frame:假设我有这个数据框:

import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
  
df = spark.read.option("header", "true") \
    .option("delimiter", "\t") \
    .option("inferSchema", "true") \
    .csv("/content/drive/MyDrive/my_data.txt")
    



data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))

from pyspark.sql.functions import array_distinct

from pyspark.sql.functions import udf

data = data.withColumn("New_Data", array_distinct("Query"))

Z = data.drop(data.Query) 

+------+------------------------+
|AnonID|            New_Data    |
+------+------------------------+
|   142|[Big House, Green frog] |
+------+------------------------+

And I want output like that:我想要这样的 output:

+------+--------------------------+
|AnonID|            New_Data      |
+------+--------------------------+
|   142|[Big, House, Green, frog] |
+------+--------------------------+

I have tried to search In older posts but I was able to find only something that separates each word to different column and it's not what I want.我试图在较旧的帖子中搜索,但我只能找到将每个单词分隔到不同列的内容,这不是我想要的。

To separate the elements in an array and split each string into separate words, you can use the explode and split functions in Spark.要分隔数组中的元素并将每个字符串拆分为单独的单词,您可以使用 Spark 中的 explode 和 split 函数。 The exploded elements can then be combined back into an array using the array function.然后可以使用数组 function 将分解的元素组合回数组。

from pyspark.sql.functions import explode, split, array

data = data.withColumn("Words", explode(split(data["New_Data"], " ")))
data = data.groupBy("AnonID").agg(array(data["Words"]).alias("New_Data"))

You can do the collect_list first and then use the transform function to split the array elements and then flatten the elements and then finally apply array_distinct.您可以先执行 collect_list,然后使用变换 function 拆分数组元素,然后展平元素,最后应用 array_distinct。 Please check out the code and output below.请查看下面的代码和output。

df=spark.createDataFrame([[142,"Big House"],[142,"Big Green Frog"]],["AnonID","Query"])

import pyspark.sql.functions as F
data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))

data.withColumn("Query",F.array_distinct(flatten(transform(data["Query"], lambda x: split(x, " "))))).show(2,False)

+------+-------------------------+
|AnonID|Query                    |
+------+-------------------------+
|142   |[Big, House, Green, Frog]|
+------+-------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM