Pyspark：重命名DataFrame列中的字典鍵

Question

經過一些處理后，我得到一個數據框，其中在數據框列中有一個字典。 現在，我想在該列中更改字典的鍵。 從“ _1”到“ product_id” ，從“ _2”到“ timestamp” 。

這是處理代碼：

df1 = data.select("user_id","product_id","timestamp_gmt").rdd.map(lambda x: (x[0], (x[1],x[2]))).groupByKey()\
.map(lambda x:(x[0], list(x[1]))).toDF()\
.withColumnRenamed('_1', 'user_id')\
.withColumnRenamed('_2', 'purchase_info')

結果如下：

Answer 1

Spark 2.0+

使用collect_list和struct ：

from pyspark.sql.functions import collect_list, struct, col

df = sc.parallelize([
    (1, 100, "2012-01-01 00:00:00"),
    (1, 200, "2016-04-04 00:00:01")
]).toDF(["user_id","product_id","timestamp_gmt"])

pi = (collect_list(struct(col("product_id"), col("timestamp_gmt")))
    .alias("purchase_info"))

df.groupBy("user_id").agg(pi)

火花<2.0

使用Rows ：

(df
    .select("user_id", struct(col("product_id"), col("timestamp_gmt")))
    .rdd.groupByKey()
    .toDF(["user_id", "purchase_info"]))

可以說它更優雅，但與替換傳遞給您的map函數具有相似的效果：

lambda x: (x[0], Row(product_id=x[1], timestamp_gmt=x[2]))

StructType這些不是字典（ MapType ），而是structs （ StructType ）。

Pyspark：重命名DataFrame列中的字典鍵

問題描述

1 個解決方案

解決方案1
3 已采納 2016-05-26 02:00:18

Pyspark：重命名DataFrame列中的字典鍵

問題描述

1 個解決方案

解決方案1 3 已采納 2016-05-26 02:00:18

解決方案1
3 已采納 2016-05-26 02:00:18