简体   繁体   English

从Spark中的单行生成多行

[英]Generate multiple rows from single row in spark

I have some data in cassandra with the following data model: 我在cassandra中具有以下数据模型的一些数据:

transaction_id : uuid
start_date: timestamp
end_date: timestamp
PRIMARY KEY(transaction_id)

Now I want to transform this data into something : 现在,我想将这些数据转换为某种形式:

aggregation_date : timestamp
number_of_active_transaction_0 : int
number_of_active_transaction_1 : int
number_of_active_transaction_2 : int
...
number_of_open_transaction_23 : int
PRIMARY KEY((aggregation_date))

Currently I have created a function which takes the start and end dates and returns a tuple of transaction_date (just the Date part) and 24 size array with value 1 for the hours when the transaction was active and map the original RDD into a PairRDD with transaction_date (just the Date part) as key and the array as value. 目前,我已经创建了一个函数,该函数接受开始日期和结束日期,并返回一个元组transaction_date(仅是Date部分)和24个大小为1的大小数组(在事务处于活动状态时),并将原始RDD映射为具有transaction_date的PairRDD (仅日期部分)作为键,数组作为值。 After this performing a reduce on the key and adding all the individual elements of the array to get the desired output. 之后,对键进行缩减并添加数组的所有单个元素以获取所需的输出。

Now the problem is, there are instances when the transaction starts late in the night and is completed post midnight so in such cases I want to return 2 rows from my function so that for every transaction I get 2 rows in the returned RDD. 现在的问题是,在某些情况下,当事务在深夜开始并且在午夜之后完成时,在这种情况下,我想从函数中返回2行,这样对于每笔事务,我都会在返回的RDD中得到2行。

Spark version: 1.2.2 Spark版本:1.2.2
API used is Scala 使用的API是Scala
Spark Cassandra connector version 1.2.2 Spark Cassandra连接器1.2.2版

You will likely want to use flatMap , with flatMap you can output multiple (including zero) elements for each input. 您可能希望使用flatMap ,通过flatMap您可以为每个输入输出多个(包括零个)元素。

However you also mention you are performing a reduce on the key, if its during this phase when you need to output multiple elements you can just produce a list during the reduceByKey and then just do an identity flatMap , which will flatten all the results. 但是,您还提到要对键执行reduce,如果在此阶段中需要输出多个元素,则可以在reduceByKey期间生成一个列表,然后只做一个flatMap身份,它将所有结果展平。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在spark中将多行卷成单行和单列 - Rolling up multiple rows into a single row and column in spark 如何使用拆分/分解从单行数据生成多行 - How to Generate multiple rows from single row data with split/explode function - Apache Pyspark SQL/scala sql Spark基于特定列将多行组合为单行,而无需进行groupBy操作 - Spark combine multiple rows to Single row base on specific Column with out groupBy operation 如何将火花 dataframe 中的多行组合成具有相同 ID 的单行 - How to combine multiple rows in spark dataframe into single row having the same id 如何将 DataFrame 中的行分组为由分隔符 Scala Spark 分隔的单行? - how can I group rows from a DataFrame into a single row separated by a delimiter Scala Spark? scala rdd flatmap从一行生成多行以填补行空白 - scala rdd flatmap to generate multiple row from one row to en-fill gap of rows issue 基于相邻行和单行减少火花组和减少 - spark group and reduce based on adjacent rows as well as single row 使用Scala在Spark数据框中将单行分为两行 - Split single row into two rows in a spark dataframe using scala 使用Spark Scala根据行值(示例文件中的标头记录)从单个文件创建多个RDD - Create multiple RDDs from single file based on row value ( header record in sample file) using Spark scala 如何使用Spark Scala根据日期将行拆分为多行? - how to split row into multiple rows on the basis of date using spark scala?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM