简体   繁体   English

如何使用reduceByKey(pyspark)拥有嵌套结构?

[英]how to have a nested structure with reduceByKey (pyspark)?

I'm working with spark (pyspark) on a data set which I want to partition based on 3 values and write back to S3.我正在使用 spark (pyspark) 处理我想基于 3 个值进行分区并写回 S3 的数据集。 Data set looks like below -数据集如下所示 -

customerId, productId, createDate客户 ID、产品 ID、创建日期

I would like to partition this data by customerId then productId then createDate .我想按customerId 然后 productId 然后 createDate对这些数据进行分区。 So when I write this partitioned data to s3, it should have the below structure -因此,当我将此分区数据写入 s3 时,它应该具有以下结构 -

customerId=1
  productId='A1'
    createDate=2019-10
    createDate=2019-11
    createDate=2019-12
  productId='A2'
    createDate=2019-10
    createDate=2019-11
    createDate=2019-12

below is the code that I'm using to create the partition.下面是我用来创建分区的代码。

rdd = sc.textFile("data.json")  #sc is spark context
r1.map(lambda r: (r["customerId"], r["productId"],r["createDate"])).distinct().map(lambda r: (r[0], ([r[1]],[r[2]]))).reduceByKey(lambda a, b: (a[0] + b[0],a[1] + b[1])).collect()

[('1', ([A1,A2], ['2019-12', '2019-11', '2019-10', '2019-12', '2019-11', '2019-10']))] [('1', ([A1,A2], ['2019-12', '2019-11', '2019-10', '2019-12', '2019-11', '2019-10'] ))]

This code does gives me a flat structure and not the nested which I mentioned.这段代码确实给了我一个平面结构,而不是我提到的嵌套结构。 Is it possible to transform the way I describe.是否有可能改变我描述的方式。 any pointer is highly appretiated.任何指针都非常受欢迎。

first read your JSON file to dataframe.首先阅读您的 JSON 文件到 dataframe。

import json
a=[json.dumps("/data.json")]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)

then use groupby and collectlist to get the desired format.然后使用groupbycollectlist来获得所需的格式。

import pyspark.sql.functions as func
df.groupby('customerId','productId').agg(func.collectList('createDate')).collect()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM