简体   繁体   中英

Separate string by white space in pyspark

I have column with search queries that are represented by strings. I want to separate every string to different work.

Let say I have this data frame:

import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
  
df = spark.read.option("header", "true") \
    .option("delimiter", "\t") \
    .option("inferSchema", "true") \
    .csv("/content/drive/MyDrive/my_data.txt")
    



data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))

from pyspark.sql.functions import array_distinct

from pyspark.sql.functions import udf

data = data.withColumn("New_Data", array_distinct("Query"))

Z = data.drop(data.Query) 

+------+------------------------+
|AnonID|            New_Data    |
+------+------------------------+
|   142|[Big House, Green frog] |
+------+------------------------+

And I want output like that:

+------+--------------------------+
|AnonID|            New_Data      |
+------+--------------------------+
|   142|[Big, House, Green, frog] |
+------+--------------------------+

I have tried to search In older posts but I was able to find only something that separates each word to different column and it's not what I want.

To separate the elements in an array and split each string into separate words, you can use the explode and split functions in Spark. The exploded elements can then be combined back into an array using the array function.

from pyspark.sql.functions import explode, split, array

data = data.withColumn("Words", explode(split(data["New_Data"], " ")))
data = data.groupBy("AnonID").agg(array(data["Words"]).alias("New_Data"))

You can do the collect_list first and then use the transform function to split the array elements and then flatten the elements and then finally apply array_distinct. Please check out the code and output below.

df=spark.createDataFrame([[142,"Big House"],[142,"Big Green Frog"]],["AnonID","Query"])

import pyspark.sql.functions as F
data = df.groupBy("AnonID").agg(F.collect_list("Query").alias("Query"))

data.withColumn("Query",F.array_distinct(flatten(transform(data["Query"], lambda x: split(x, " "))))).show(2,False)

+------+-------------------------+
|AnonID|Query                    |
+------+-------------------------+
|142   |[Big, House, Green, Frog]|
+------+-------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM