繁体   English   中英

Pyspark:重命名DataFrame列中的字典键

[英]Pyspark: Rename a dictionary key which is within a DataFrame column

经过一些处理后,我得到一个数据框,其中在数据框列中有一个字典。 现在,我想在该列中更改字典的键。 “ _1”“ product_id” ,从“ _2”“ timestamp”

这是处理代码:

df1 = data.select("user_id","product_id","timestamp_gmt").rdd.map(lambda x: (x[0], (x[1],x[2]))).groupByKey()\
.map(lambda x:(x[0], list(x[1]))).toDF()\
.withColumnRenamed('_1', 'user_id')\
.withColumnRenamed('_2', 'purchase_info')

结果如下:

Spark 2.0+

使用collect_liststruct

from pyspark.sql.functions import collect_list, struct, col

df = sc.parallelize([
    (1, 100, "2012-01-01 00:00:00"),
    (1, 200, "2016-04-04 00:00:01")
]).toDF(["user_id","product_id","timestamp_gmt"])

pi = (collect_list(struct(col("product_id"), col("timestamp_gmt")))
    .alias("purchase_info"))

df.groupBy("user_id").agg(pi)

火花<2.0

使用Rows

(df
    .select("user_id", struct(col("product_id"), col("timestamp_gmt")))
    .rdd.groupByKey()
    .toDF(["user_id", "purchase_info"]))

可以说它更优雅,但与替换传递给您的map函数具有相似的效果:

lambda x: (x[0], Row(product_id=x[1], timestamp_gmt=x[2]))

StructType这些不是字典( MapType ),而是structsStructType )。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM