[英]Generate multiple rows from single row in spark
I have some data in cassandra with the following data model: 我在cassandra中具有以下数据模型的一些数据:
transaction_id : uuid
start_date: timestamp
end_date: timestamp
PRIMARY KEY(transaction_id)
Now I want to transform this data into something : 现在,我想将这些数据转换为某种形式:
aggregation_date : timestamp
number_of_active_transaction_0 : int
number_of_active_transaction_1 : int
number_of_active_transaction_2 : int
...
number_of_open_transaction_23 : int
PRIMARY KEY((aggregation_date))
Currently I have created a function which takes the start and end dates and returns a tuple of transaction_date (just the Date part) and 24 size array with value 1 for the hours when the transaction was active and map the original RDD into a PairRDD with transaction_date (just the Date part) as key and the array as value. 目前,我已经创建了一个函数,该函数接受开始日期和结束日期,并返回一个元组transaction_date(仅是Date部分)和24个大小为1的大小数组(在事务处于活动状态时),并将原始RDD映射为具有transaction_date的PairRDD (仅日期部分)作为键,数组作为值。 After this performing a reduce on the key and adding all the individual elements of the array to get the desired output.
之后,对键进行缩减并添加数组的所有单个元素以获取所需的输出。
Now the problem is, there are instances when the transaction starts late in the night and is completed post midnight so in such cases I want to return 2 rows from my function so that for every transaction I get 2 rows in the returned RDD. 现在的问题是,在某些情况下,当事务在深夜开始并且在午夜之后完成时,在这种情况下,我想从函数中返回2行,这样对于每笔事务,我都会在返回的RDD中得到2行。
Spark version: 1.2.2 Spark版本:1.2.2
API used is Scala 使用的API是Scala
Spark Cassandra connector version 1.2.2 Spark Cassandra连接器1.2.2版
You will likely want to use flatMap
, with flatMap
you can output multiple (including zero) elements for each input. 您可能希望使用
flatMap
,通过flatMap
您可以为每个输入输出多个(包括零个)元素。
However you also mention you are performing a reduce on the key, if its during this phase when you need to output multiple elements you can just produce a list during the reduceByKey
and then just do an identity flatMap
, which will flatten all the results. 但是,您还提到要对键执行reduce,如果在此阶段中需要输出多个元素,则可以在
reduceByKey
期间生成一个列表,然后只做一个flatMap
身份,它将所有结果展平。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.