Aggregate S3 data for Spark operation

Question

I have data in S3 that's being written there with a directory structure as follows: YYYY/MM/DD/HH

I am trying to write a program which will take in a start date, aggregate the data between those dates, and then convert it to spark RDD to perform operations on.

Is there a clean way to do this aggregation with iterating over every combination of YYYY/MM/DD/HH and without having to do a tree to find where it's safe to use wildcards?

Ie if input is start= 2014/01/01/00 end=2016/02/01/00 do I have to access each bucket individually (ie 2014/ / / , 2015/ / / , 2016/1/ / )?

Answer 1

are you using dataframe? If that is the case then you can leverage the loading data from multiple source API. Take a look at this documentation: https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.DataFrameReader

def load(paths: String*): DataFrame

the above method supports multiple source

Also discussion from this stack overflow thread might be helpful for you: Reading multiple files from S3 in Spark by date period

Aggregate S3 data for Spark operation

Question

1 answers

solution1
0 2017-06-01 11:49:36

Aggregate S3 data for Spark operation

Question

1 answers

solution1 0 2017-06-01 11:49:36

solution1
0 2017-06-01 11:49:36