I am trying to figure it out, how we can read only latest 7 days file from a folder which we have in s3 bucket using Spark Scala.
Directory which we have:
Assume for today's date(Date_1) we have 2 clients and 1-1 csv file
Source/Date_1/Client_1/sample_1.csv
Source/Date_1/Client_2/sample_1.csv
Tomorrow a new folder will generate and we will get as below:
Source/Date_2/Client_1/sample_1.csv
Source/Date_2/Client_2/sample_1.csv
Source/Date_2/Client_3/sample_1.csv
Source/Date_2/Client_4/sample_1.csv
NOTE: we expecting to have newer client data added on any date.
Likewise on 7th day we can have:
Source/Date_7/Client_1/sample_1.csv
Source/Date_7/Client_2/sample_1.csv
Source/Date_7/Client_3/sample_1.csv
Source/Date_7/Client_4/sample_1.csv
So, now if we get 8th day data, We need to discard the Date_1 folder to get read.
How we can do this while reading csv files using saprk scala from s3 bucket? I am trying to read the whole "source/*" folder so that we should not miss if any client is getting added any time/day.
There are various ways to do it. One of the ways is mentioned below: You can extract the Date from Path and the filter is based on the 7 Days.
Below is a code snippet for pyspark, the same can be implemented in Spark with Scala.
>>> from datetime import datetime, timedelta
>>> from pyspark.sql.functions import *
#Calculate date 7 days before date
>>> lastDate = datetime.now() + timedelta(days=-7)
>>> lastDate = int(lastDate.strftime('%Y%m%d'))
# Source Path
>>> srcPath = "s3://<bucket-name>/.../Source/"
>>> df1 = spark.read.option("header", "true").csv(srcPath + "*/*").withColumn("Date", split(regexp_replace(input_file_name(), srcPath, ""),"/")[0].cast("long"))
>>> df2 = df1.filter(col("Date") >= lit(lastDate))
There are few things that might change in your final implementation, such as Index value [0] that might differ if the path structure is different and the last, the condition >= that can be > based on the requirement.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.