简体   繁体   中英

How to read only latest 7 days csv files from S3 bucket

I am trying to figure it out, how we can read only latest 7 days file from a folder which we have in s3 bucket using Spark Scala.

Directory which we have:

Assume for today's date(Date_1) we have 2 clients and 1-1 csv file

Source/Date_1/Client_1/sample_1.csv

Source/Date_1/Client_2/sample_1.csv

Tomorrow a new folder will generate and we will get as below:

Source/Date_2/Client_1/sample_1.csv

Source/Date_2/Client_2/sample_1.csv

Source/Date_2/Client_3/sample_1.csv

Source/Date_2/Client_4/sample_1.csv

NOTE: we expecting to have newer client data added on any date.

Likewise on 7th day we can have:

Source/Date_7/Client_1/sample_1.csv

Source/Date_7/Client_2/sample_1.csv

Source/Date_7/Client_3/sample_1.csv

Source/Date_7/Client_4/sample_1.csv

So, now if we get 8th day data, We need to discard the Date_1 folder to get read.

How we can do this while reading csv files using saprk scala from s3 bucket? I am trying to read the whole "source/*" folder so that we should not miss if any client is getting added any time/day.

There are various ways to do it. One of the ways is mentioned below: You can extract the Date from Path and the filter is based on the 7 Days.

Below is a code snippet for pyspark, the same can be implemented in Spark with Scala.

>>> from datetime import datetime, timedelta
>>> from pyspark.sql.functions import *
 
#Calculate date 7 days before date
>>> lastDate = datetime.now() + timedelta(days=-7)
>>> lastDate = int(lastDate.strftime('%Y%m%d'))

# Source Path
>>> srcPath = "s3://<bucket-name>/.../Source/"
>>> df1 = spark.read.option("header", "true").csv(srcPath + "*/*").withColumn("Date", split(regexp_replace(input_file_name(), srcPath, ""),"/")[0].cast("long"))
>>> df2 = df1.filter(col("Date") >= lit(lastDate))

There are few things that might change in your final implementation, such as Index value [0] that might differ if the path structure is different and the last, the condition >= that can be > based on the requirement.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM