简体   繁体   English

Amazon Web Services S3中的线程安全文件重命名

[英]Thread safe file rename in Amazon Web Services S3

I need to move rename an object in AWS S3 storage. 我需要在AWS S3存储中移动重命名对象。

All the solutions I have found require a copy followed by a delete. 我找到的所有解决方案都需要复制,然后删除。 However, this leaves a short time where both files exists, which I do not believe would be thread safe. 但是,这使两个文件同时存在很短的时间,我认为这不是线程安全的。

Is there a way to do this in a thread safe manner? 有没有办法以线程安全的方式做到这一点?

The code is in Scala using the Java AWS SDK. 该代码在使用Java AWS开发工具包的Scala中。

EDIT: Rob, thanks for the reply, I believe I understand that code is doing, but makes me think I asked the wrong question. 编辑:罗伯,感谢您的答复,我相信我知道代码在做什么,但是让我觉得我问错了问题。

Rather than specific AWS functionality, let me describe it in terms of what I am trying to accomplish. 除了我要实现的功能以外,让我来描述它而不是特定的AWS功能。

I have an S3 directory that is regularly receiving files from an outside source. 我有一个S3目录,该目录定期从外部源接收文件。 I have multiple processes that need to 'process' those files, and each file should only be processed once. 我有多个进程需要“处理”这些文件,每个文件仅应处理一次。

In the past as a cheap way of handling this, I've used a rename to either move the file or mark it as processing. 过去,作为一种廉价的处理方式,我使用重命名来移动文件或将其标记为正在处理。 If the rename succeeded, then the process knew it 'owned' the file and would continue processing. 如果重命名成功,则该进程知道它“拥有”该文件并将继续处理。 If it failed because the source file did not exist, then it would try the next file in the directory. 如果由于源文件不存在而失败,则它将尝试目录中的下一个文件。

What I am needing is a way, preferably S3 only, that will allow multiple processes to work on the files, while ensuring that each file is only processed once. 我需要的是一种方法,最好仅使用S3,该方法将允许多个进程在文件上工作,同时确保每个文件仅处理一次。

In your solutions below, since 'find' and 'delete' are separate methods, and delete does not fail if the file does not exist, I'm not sure I see how the two processes can't simply (in worst case scenario) both complete in lockstep with the other. 在下面的解决方案中,由于“查找”和“删除”是单独的方法,并且如果文件不存在,删除也不会失败,所以我不确定我是否可以简单地看到两个过程(在最坏的情况下)两者都与对方紧紧完成。

File moving may be the wrong solution, and my inexperience with AWS preventing me from seeing a better way to accomplish this task. 文件移动可能是错误的解决方案,而我对AWS的经验不足使我无法找到更好的方法来完成此任务。

In the past as a cheap way of handling this, I've used a rename to either move the file or mark it as processing. 过去,作为一种廉价的处理方式,我使用重命名来移动文件或将其标记为正在处理。 If the rename succeeded, then the process knew it 'owned' the file and would continue processing. 如果重命名成功,则该进程知道它“拥有”该文件并将继续处理。 If it failed because the source file did not exist, then it would try the next file in the directory. 如果由于源文件不存在而失败,则它将尝试目录中的下一个文件。

Let me start by pointing out that this technique of using an atomic rename for a thread to acquire exclusive access to process a file works, but it does risk leaving a file unprocessed. 让我首先指出,这种为线程使用原子重命名以获得排他访问权以处理文件的技术是可行的,但这样做确实有使文件未经处理的风险。 Imagine what happens if the thread (or whole server) dies right after the rename. 想象一下,如果线程(或整个服务器)在重命名后立即死亡,将会发生什么。 Without a robust way to keep track of which files are not yet complete and a way to retry them, your system will not be very resilient. 如果没有一种健壮的方法来跟踪哪些文件尚未完成以及没有一种方法可以重试它们,则您的系统将不会非常灵活。

As you note, S3 does not have an atomic rename operation, so your usual technique doesn't work as you desire. 如您所述,S3没有原子重命名操作,因此您常用的技术无法按您期望的那样工作。

S3 has a nice "notification" feature that can be configured. S3具有很好的“通知”功能,可以进行配置。 In your case, you probably want to get notified when a file is created. 就您而言,您可能希望在创建文件时得到通知。 Notifications can be delivered to SNS, SQS or Lambda. 通知可以传递到SNS,SQS或Lambda。 You probably want either SQS or Lambda. 您可能需要SQS或Lambda。 With SQS, a message gets added to a queue, which you can have a thread grab and process the file. 使用SQS,会将消息添加到队列中,您可以在其中获取线程并处理文件。 The SQS model guarantees delivery "at least once" and will retry delivery until the message is deleted (or ages out of the queue). SQS模型保证“至少一次”传递,并将重试传递,直到消息被删除(或过期)。 The redeliver-if-not-deleted time is configurable. 如果未删除,重新交付时间是可配置的。 Note that it is possible for SQS to deliver the same message multiple times - they err on the side of over-delivering rather than not delivering a message. 请注意,SQS可能多次传递相同的消息-它们过度传递而不是不传递消息是错误的。 If it is ok to double-process a file on a very infrequent basis, then this probably works fine for you. 如果可以不经常对文件进行双重处理,那么这可能对您来说很好。 We make extensive use of SQS queues and are happy. 我们广泛使用SQS队列,并感到高兴。

I am not familiar with the detailed semantics of the Lambda message processing. 我对Lambda消息处理的详细语义不熟悉。

I suggest that you google "S3 Event Notifications" for more details. 我建议您在Google中搜索“ S3事件通知”以了解更多详细信息。

Original answer to the original question: 原始问题的原始答案:

I am not sure the issue is "thread safety" - perhaps more "transactional integrity"? 我不确定问题是“线程安全”-也许更多的“事务完整性”?

In any case, you are correct that doing an S3 "atomic" rename is not obvious. 无论如何,对S3进行“原子”重命名并不明显。 I think you have to "pick your poison" - either you have to deal with the fact that 1) you have the old and new copies at the same time or 2) you have a period of time where you have neither the old nor the new copy. 我认为您必须“挑起毒药”-您必须处理以下事实:1)您同时拥有旧副本和新副本,或者2)您有一段时间既没有旧版本也没有旧版本新副本。

In either case, a key issue you need to deal with is persisting the fact that you are doing the rename (until the rename is confirmed to be complete). 无论哪种情况,您都需要解决的一个关键问题是坚持您正在进行重命名的事实(直到确认重命名已完成)。 If you have a row in some database that represents the file, then you can persist the state there. 如果某个数据库中有代表该文件的行,则可以在其中保留状态。 The following assumes that you don't want to use anything other than S3 to persist state. 以下假设您不希望使用S3以外的任何其他东西来保持状态。

You are going to actually copy the file twice, using a temporary folder for the intermediate copy. 您将使用中间副本的临时文件夹实际将文件复制两次。 You can have separate threads doing each step (looking for files to work on), or a single thread that checks the various conditions and does the remaining steps. 您可以使每个步骤(寻找要处理的文件)有单独的线程,也可以由一个线程检查各个条件并执行其余步骤。 In other words, you need to look for renames that were partially done (but that thread failed to finish) and pick up where it left off. 换句话说,您需要查找部分完成的重命名(但是该线程无法完成),并在中断的地方继续进行。

For this example, we are going to rename from A to B and use a temporary folder called tmp. 在此示例中,我们将从A重命名为B,并使用一个名为tmp的临时文件夹。

If you prefer briefly having both copies: 如果您希望短暂拥有两个副本:

1. Copy A to tmp/A-B (the file name has before and after names in it).
2. Finding tmp/A-B: copy it to B.
3. Finding tmp/A-B, A and B: delete A.
4. Finding tmp/A-B, A is missing and B exists: delete tmp/A-B.

If you prefer briefly having neither copy: 如果您希望暂时不复制它们:

1. Copy A to tmp/A-B.
2. Finding tmp/A-B and A: delete A.
3. Finding tmp/A-B and A is missing and B is missing: copy tmp/A-B to B.
4. Finding tmp/A-B and A is missing and B exists: delete tmp/A-B.

不,S3 API不支持

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM