How to create N duplicated rows in PySpark DataFrame?

Question

I have the following PySpark DataFrame df :

itemid  eventid    timestamp     timestamp_end   n
134     30         2016-07-02    2016-07-09      2
134     32         2016-07-03    2016-07-10      2
125     32         2016-07-10    2016-07-17      1

I want to convert this DataFrame into the following one:

itemid  eventid    timestamp_start   timestamp     timestamp_end
134     30         2016-07-02        2016-07-02    2016-07-09
134     32         2016-07-02        2016-07-03    2016-07-09
134     30         2016-07-03        2016-07-02    2016-07-10
134     32         2016-07-03        2016-07-03    2016-07-10
125     32         2016-07-10        2016-07-10    2016-07-17

Basically, for each unique value of itemid , I need to take timestamp and put it into a new column timestamp_start . Thus, each row within the group of itemid should be duplicated n times, where n is the number of records in a group. Hopefully I explained it clearly.

This is my initial DataFrame in PySpark:

from pyspark.sql.functions import col, expr

df = (
    sc.parallelize([
        (134, 30, "2016-07-02", "2016-07-09"), (134, 32, "2016-07-03", "2016-07-10"),
        (125, 32, "2016-07-10", "2016-07-17"),
    ]).toDF(["itemid", "eventid", "timestamp", "timestamp_end"])
    .withColumn("timestamp", col("timestamp").cast("timestamp"))
    .withColumn("timestamp_end", col("timestamp_end").cast("timestamp_end"))
)

So far I managed to copy rows n times:

new_df = df.withColumn("n", expr("explode(array_repeat(n,int(n)))"))

But how can I create timestamp_start as shown in the example above?

Thanks.

Answer 1

IIUC, you can use Window function collect_list to find a list of all timestamp+timestamp_end in a group, and then use SparkSQL builtin function inline/inline_outer to explode the resulting array of structs:

from pyspark.sql.functions import collect_list, expr
from pyspark.sql import Window

w1 = Window.partitionBy('itemid')

df.withColumn('timestamp_range',  
    collect_list(expr("(timestamp as timestamp_start, timestamp_end)")).over(w1)
 ).selectExpr(
    'itemid',  
    'eventid', 
    'timestamp', 
    'inline_outer(timestamp_range)'
 ).show()    
+------+-------+----------+---------------+-------------+
|itemid|eventid| timestamp|timestamp_start|timestamp_end|
+------+-------+----------+---------------+-------------+
|   134|     30|2016-07-02|     2016-07-02|   2016-07-09|
|   134|     30|2016-07-02|     2016-07-03|   2016-07-10|
|   134|     32|2016-07-03|     2016-07-02|   2016-07-09|
|   134|     32|2016-07-03|     2016-07-03|   2016-07-10|
|   125|     32|2016-07-10|     2016-07-10|   2016-07-17|
+------+-------+----------+---------------+-------------+

Where: timestamp_range is a collect_list of the following named_struct(in SparkSQL syntax):

(timestamp as timestamp_start, timestamp_end)

which is the same as following:

named_struct('timestamp_start', timestamp, 'timestamp_end', timestamp_end)

How to create N duplicated rows in PySpark DataFrame?

Question

1 answers

solution1
1 ACCPTED 2020-01-09 21:34:49

How to create N duplicated rows in PySpark DataFrame?

Question

1 answers

solution1 1 ACCPTED 2020-01-09 21:34:49

solution1
1 ACCPTED 2020-01-09 21:34:49