简体   繁体   中英

Aggregate S3 data for Spark operation

I have data in S3 that's being written there with a directory structure as follows: YYYY/MM/DD/HH

I am trying to write a program which will take in a start date, aggregate the data between those dates, and then convert it to spark RDD to perform operations on.

Is there a clean way to do this aggregation with iterating over every combination of YYYY/MM/DD/HH and without having to do a tree to find where it's safe to use wildcards?

Ie if input is start= 2014/01/01/00 end=2016/02/01/00 do I have to access each bucket individually (ie 2014/ / / , 2015/ / / , 2016/1/ / )?

are you using dataframe? If that is the case then you can leverage the loading data from multiple source API. Take a look at this documentation: https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.DataFrameReader

def load(paths: String*): DataFrame

the above method supports multiple source

Also discussion from this stack overflow thread might be helpful for you: Reading multiple files from S3 in Spark by date period

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM