I have the following PySpark DataFrame df
:
itemid eventid timestamp timestamp_end n
134 30 2016-07-02 2016-07-09 2
134 32 2016-07-03 2016-07-10 2
125 32 2016-07-10 2016-07-17 1
I want to convert this DataFrame into the following one:
itemid eventid timestamp_start timestamp timestamp_end
134 30 2016-07-02 2016-07-02 2016-07-09
134 32 2016-07-02 2016-07-03 2016-07-09
134 30 2016-07-03 2016-07-02 2016-07-10
134 32 2016-07-03 2016-07-03 2016-07-10
125 32 2016-07-10 2016-07-10 2016-07-17
Basically, for each unique value of itemid
, I need to take timestamp
and put it into a new column timestamp_start
. Thus, each row within the group of itemid
should be duplicated n
times, where n
is the number of records in a group. Hopefully I explained it clearly.
This is my initial DataFrame in PySpark:
from pyspark.sql.functions import col, expr
df = (
sc.parallelize([
(134, 30, "2016-07-02", "2016-07-09"), (134, 32, "2016-07-03", "2016-07-10"),
(125, 32, "2016-07-10", "2016-07-17"),
]).toDF(["itemid", "eventid", "timestamp", "timestamp_end"])
.withColumn("timestamp", col("timestamp").cast("timestamp"))
.withColumn("timestamp_end", col("timestamp_end").cast("timestamp_end"))
)
So far I managed to copy rows n
times:
new_df = df.withColumn("n", expr("explode(array_repeat(n,int(n)))"))
But how can I create timestamp_start
as shown in the example above?
Thanks.
IIUC, you can use Window function collect_list to find a list of all timestamp+timestamp_end in a group, and then use SparkSQL builtin function inline/inline_outer to explode the resulting array of structs:
from pyspark.sql.functions import collect_list, expr
from pyspark.sql import Window
w1 = Window.partitionBy('itemid')
df.withColumn('timestamp_range',
collect_list(expr("(timestamp as timestamp_start, timestamp_end)")).over(w1)
).selectExpr(
'itemid',
'eventid',
'timestamp',
'inline_outer(timestamp_range)'
).show()
+------+-------+----------+---------------+-------------+
|itemid|eventid| timestamp|timestamp_start|timestamp_end|
+------+-------+----------+---------------+-------------+
| 134| 30|2016-07-02| 2016-07-02| 2016-07-09|
| 134| 30|2016-07-02| 2016-07-03| 2016-07-10|
| 134| 32|2016-07-03| 2016-07-02| 2016-07-09|
| 134| 32|2016-07-03| 2016-07-03| 2016-07-10|
| 125| 32|2016-07-10| 2016-07-10| 2016-07-17|
+------+-------+----------+---------------+-------------+
Where: timestamp_range is a collect_list of the following named_struct(in SparkSQL syntax):
(timestamp as timestamp_start, timestamp_end)
which is the same as following:
named_struct('timestamp_start', timestamp, 'timestamp_end', timestamp_end)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.