Reading the latest s3 file in spark structure streaming

Question

I have a spark structure streaming code, that reads the JSON file from the s3 bucket and writes it back to s3. Input file path format:

val inputPath = s3://<path>/2022-08-26

Output file path format:

val outputPath = s3://<path>/2022-08-26

Code:

val spark = SparkSession.builder().appName("raw_data").enableHiveSupport().getOrCreate()

val df = spark.readStream.option("startingPosition","earliest").schema(LogSchema).json(inputPath)


val query = df.
      writeStream.
      outputMode("append").
      partitionBy("day").
      format("parquet").
      option("path", "s3://<path>/raw_data/data/").
      option("checkpointLocation", "s3://<path>/raw_data/checkpoint/").
      trigger(Trigger.ProcessingTime("300 seconds")).
      start()

Issue Facing:

We want to read the latest files received on the s3 bucket partition by the current day(not the old one).
Writing the file on the s3 bucket should be partitioned on the current day.

Please help me to resolve the above issue.

Answer 1

You can get this behaviour using hive on top of S3 where you can create hive partition by day/date. You can create a hive table where location is pointed to S3. you can then read and write based on hive table partition. In your case that partition will be currentdate.

You can refer this article https://medium.com/analytics-vidhya/hive-using-s3-and-scala-af5524302758

Reading the latest s3 file in spark structure streaming

Question

1 answers

solution1
0 2022-08-28 14:33:18

Reading the latest s3 file in spark structure streaming

Question

1 answers

solution1 0 2022-08-28 14:33:18

solution1
0 2022-08-28 14:33:18