简体   繁体   中英

Reading the latest s3 file in spark structure streaming

I have a spark structure streaming code, that reads the JSON file from the s3 bucket and writes it back to s3. Input file path format:

val inputPath = s3://<path>/2022-08-26

Output file path format:

val outputPath = s3://<path>/2022-08-26

Code:

val spark = SparkSession.builder().appName("raw_data").enableHiveSupport().getOrCreate()

val df = spark.readStream.option("startingPosition","earliest").schema(LogSchema).json(inputPath)


val query = df.
      writeStream.
      outputMode("append").
      partitionBy("day").
      format("parquet").
      option("path", "s3://<path>/raw_data/data/").
      option("checkpointLocation", "s3://<path>/raw_data/checkpoint/").
      trigger(Trigger.ProcessingTime("300 seconds")).
      start()

Issue Facing:

  1. We want to read the latest files received on the s3 bucket partition by the current day(not the old one).
  2. Writing the file on the s3 bucket should be partitioned on the current day.

Please help me to resolve the above issue.

You can get this behaviour using hive on top of S3 where you can create hive partition by day/date. You can create a hive table where location is pointed to S3. you can then read and write based on hive table partition. In your case that partition will be currentdate.

You can refer this article https://medium.com/analytics-vidhya/hive-using-s3-and-scala-af5524302758

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM