Pyspark: Rename a dictionary key which is within a DataFrame column

Question

After some processing I get a dataframe where I have a dictionary within a dataframe column. Now I want to change the key of the dictionary within the column. From "_1" to "product_id" and "_2" to "timestamp" .

Here is the code of the processing:

df1 = data.select("user_id","product_id","timestamp_gmt").rdd.map(lambda x: (x[0], (x[1],x[2]))).groupByKey()\
.map(lambda x:(x[0], list(x[1]))).toDF()\
.withColumnRenamed('_1', 'user_id')\
.withColumnRenamed('_2', 'purchase_info')

Here is the result:

Answer 1

Spark 2.0+

Use collect_list and struct :

from pyspark.sql.functions import collect_list, struct, col

df = sc.parallelize([
    (1, 100, "2012-01-01 00:00:00"),
    (1, 200, "2016-04-04 00:00:01")
]).toDF(["user_id","product_id","timestamp_gmt"])

pi = (collect_list(struct(col("product_id"), col("timestamp_gmt")))
    .alias("purchase_info"))

df.groupBy("user_id").agg(pi)

Spark < 2.0

Use Rows :

(df
    .select("user_id", struct(col("product_id"), col("timestamp_gmt")))
    .rdd.groupByKey()
    .toDF(["user_id", "purchase_info"]))

which is arguably more elegant but should have similar effect to replacing function you pass to map with:

lambda x: (x[0], Row(product_id=x[1], timestamp_gmt=x[2]))

On a side note these are not dictionaries ( MapType ) but structs ( StructType ).

Pyspark: Rename a dictionary key which is within a DataFrame column

Question

1 answers

solution1
3 ACCPTED 2016-05-26 02:00:18

Pyspark: Rename a dictionary key which is within a DataFrame column

Question

1 answers

solution1 3 ACCPTED 2016-05-26 02:00:18

solution1
3 ACCPTED 2016-05-26 02:00:18