简体繁体 English

始终从 spark 中的 s3 存储桶读取最新文件夹

[英]Always read latest folder from s3 bucket in spark

原文 2020-01-27 22:02:59 4 2 scala/ amazon-web-services/ apache-spark/ amazon-s3

Below is how my s3 bucket folder structure looks like,下面是我的 s3 存储桶文件夹结构的样子，

s3://s3bucket/folder1/morefolders/$folder_which_I_want_to_pick_latest/

$folder_which_I_want_to_pick_latest - This folder can always have an incrementing number for every new folder that comes in, like randomnumber_timestamp $folder_which_I_want_to_pick_latest - 此文件夹始终可以为每个进入的新文件夹增加一个编号，例如randomnumber_timestamp

Is there a way I can automate this process by always reading the most recent folder in s3 from spark in Scala有没有一种方法可以通过始终从 Scala 中的 spark 读取 s3 中的最新文件夹来自动执行此过程

2 个解决方案

The best way to work with that kind of "behavior" is structure your data as a partitioned approach, like year=2020/month=02/day=12 , where, every partition is a folder (in aws-console ).处理这种“行为”的最佳方法是将数据构建为分区方法，例如year=2020/month=02/day=12 ，其中每个分区都是一个文件夹（在aws-console ）。 In this way you can use a simple filter on spark to determine the latest one.通过这种方式，您可以在spark上使用简单的filter来确定最新的filter 。 (more info: https://www.datio.com/iaas/understanding-the-data-partitioning-technique/ ) （更多信息： https : //www.datio.com/iaas/understanding-the-data-partitioning-technique/ ）

However, if you are not allowed to re-structure your bucket, the solution could be costly if you don't have a specific identifier and/or reference that you can use to calculate your newest folder.但是，如果不允许您重新构建您的存储桶，并且您没有可用于计算最新文件夹的特定标识符和/或参考，则该解决方案可能会很昂贵。 Remember, that in s3 you don't have a concept of folder, you have only an object key (here is where you see the / and in aws console can be visualized as folders), so, to calculate the highest incremental id in $folder_which_I_want_to_pick_latest will eventually check in all the objects stored in the bucket and every object-request in s3 costs.请记住，在s3您没有文件夹的概念，您只有一个object key （在这里您可以看到/并且在aws console可以将其可视化为文件夹），因此，要计算$folder_which_I_want_to_pick_latest的最高增量 id $folder_which_I_want_to_pick_latest最终将检查存储在存储桶中的所有对象以及 s3 中的每个对象请求成本。 More info: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html .更多信息： https : //docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html 。

Here's one option.这是一种选择。 Consider writing a Lambda function that either runs on a schedule (say if you knew that your uploads always happen between 1pm and 4pm) or is triggered by an S3 object upload (so it happens for every object uploaded to folder1/morefolders/ ).考虑编写一个 Lambda 函数，该函数要么按计划运行（假设您知道您的上传总是在下午 1 点到 4 点之间发生），要么由 S3 对象上传触发（因此对于上传到folder1/morefolders/每个对象都会发生）。

The Lambda would write the relevant part(s) of the S3 object prefix into a simple DynamoDB table. Lambda 会将 S3 对象前缀的相关部分写入一个简单的 DynamoDB 表。 The client that needs to know the latest prefix would read it from DynamoDB.需要知道最新前缀的客户端将从 DynamoDB 读取它。