[英]Parallel batch file download from Amazon S3 using AWS S3 SDK for .NET
Problem: I would like to download 100 files in parallel from AWS S3 using their .NET SDK. 问题:我想使用他们的.NET SDK从AWS S3并行下载100个文件。 The downloaded content should be stored in 100 memory streams (the files are small enough, and I can take it from there). 下载的内容应存储在100个内存流中(文件足够小,我可以从那里获取)。 I am geting confused between Task, IAsyncResult, Parallel.*, and other different approaches in .NET 4.0. 我在Task,IAsyncResult,Parallel。*和.NET 4.0中的其他不同方法之间感到困惑。
If I try to solve the problem myself , off the top of my head I imagine something like this pseudocode: (edited to add types to some variables) 如果我试图自己解决这个问题 ,我会想象这样的伪代码:(编辑为某些变量添加类型)
using Amazon;
using Amazon.S3;
using Amazon.S3.Model;
AmazonS3 _s3 = ...;
IEnumerable<GetObjectRequest> requestObjects = ...;
// Prepare to launch requests
var asyncRequests = from rq in requestObjects
select _s3.BeginGetObject(rq,null,null);
// Launch requests
var asyncRequestsLaunched = asyncRequests.ToList();
// Prepare to finish requests
var responses = from rq in asyncRequestsLaunched
select _s3.EndGetRequest(rq);
// Finish requests
var actualResponses = responses.ToList();
// Fetch data
var data = actualResponses.Select(rp => {
var ms = new MemoryStream();
rp.ResponseStream.CopyTo(ms);
return ms;
});
This code launches 100 requests in parallel, which is good. 此代码并行启动100个请求,这很好。 However, there are two problems: 但是,有两个问题:
So here I start thinking I am heading down the wrong path... 所以在这里我开始想我正在走错路......
Help? 救命?
It's probably easier if you break the operation down into a method that will handle one request asynchronously and then call it 100 times. 如果将操作分解为一个异步处理一个请求然后再调用100次的方法,则可能更容易。
To start, let's identify the final result you want. 首先,让我们确定您想要的最终结果。 Since what you'll be working with is a MemoryStream
it means that you'll want to return a Task<MemoryStream>
from your method. 因为您将使用的是MemoryStream
这意味着您将要从您的方法返回Task<MemoryStream>
。 The signature will look something like this: 签名看起来像这样:
static Task<MemoryStream> GetMemoryStreamAsync(AmazonS3 s3,
GetObjectRequest request)
Because your AmazonS3
object implements the Asynchronous Design Pattern , you can use the FromAsync
method on the TaskFactory
class to generate a Task<T>
from a class that implements the Asynchronous Design Pattern, like so: 因为您的AmazonS3
对象实现了异步设计模式 ,所以您可以使用TaskFactory
类上的FromAsync
方法从实现异步设计模式的类生成Task<T>
,如下所示:
static Task<MemoryStream> GetMemoryStreamAsync(AmazonS3 s3,
GetObjectRequest request)
{
Task<GetObjectResponse> response =
Task.Factory.FromAsync<GetObjectRequest,GetObjectResponse>(
s3.BeginGetObject, s3.EndGetObject, request, null);
// But what goes here?
So you're already in a good place, you have a Task<T>
which you can wait on or get a callback on when the call completes. 所以你已经处在一个好地方,你有一个Task<T>
,你可以等待,或者在通话结束时收到回叫。 However, you need to somehow translate the GetObjectResponse
returned from the call to Task<GetObjectResponse>
into a MemoryStream
. 但是,你需要以某种方式翻译GetObjectResponse
从调用返回的Task<GetObjectResponse>
成MemoryStream
。
To that end, you want to use the ContinueWith
method on the Task<T>
class. 为此,您希望在Task<T>
类上使用ContinueWith
方法 。 Think of it as the asynchronous version of the Select
method on the Enumerable
class , it's just a projection into another Task<T>
except that each time you call ContinueWith
, you are potentially creating a new Task that runs that section of code. 可以把它想象成Enumerable
类的Select
方法的异步版本,它只是对另一个Task<T>
的投影,除了每次调用ContinueWith
,你可能会创建一个运行该段代码的新任务。
With that, your method looks like the following: 有了它,您的方法如下所示:
static Task<MemoryStream> GetMemoryStreamAsync(AmazonS3 s3,
GetObjectRequest request)
{
// Start the task of downloading.
Task<GetObjectResponse> response =
Task.Factory.FromAsync<GetObjectRequest,GetObjectResponse>(
s3.BeginGetObject, s3.EndGetObject, request, null
);
// Translate.
Task<MemoryStream> translation = response.ContinueWith(t => {
using (Task<GetObjectResponse> resp = t ){
var ms = new MemoryStream();
t.Result.ResponseStream.CopyTo(ms);
return ms;
}
});
// Return the full task chain.
return translation;
}
Note that in the above you can possibly call the overload of ContinueWith
passing TaskContinuationOptions.ExecuteSynchronously
, as it appears you are doing minimal work (I can't tell, the responses might be huge ). 请注意,在上面你可以调用ContinueWith
的重载来传递TaskContinuationOptions.ExecuteSynchronously
,因为看起来你做的工作很少(我不知道,响应可能很大 )。 In the cases where you are doing very minimal work where it would be detrimental to start a new task in order to complete the work, you should pass TaskContinuationOptions.ExecuteSynchronously
so that you don't waste time creating new tasks for minimal operations. 如果您正在进行非常小的工作而不必为了完成工作而启动新任务,则应该传递TaskContinuationOptions.ExecuteSynchronously
这样您就不会浪费时间为最少的操作创建新任务。
Now that you have the method that can translate one request into a Task<MemoryStream>
, creating a wrapper that will process any number of them is simple: 现在您已经拥有了可以将一个请求转换为Task<MemoryStream>
,创建一个可以处理任意数量的包装器的包装器很简单:
static Task<MemoryStream>[] GetMemoryStreamsAsync(AmazonS3 s3,
IEnumerable<GetObjectRequest> requests)
{
// Just call Select on the requests, passing our translation into
// a Task<MemoryStream>.
// Also, materialize here, so that the tasks are "hot" when
// returned.
return requests.Select(r => GetMemoryStreamAsync(s3, r)).
ToArray();
}
In the above, you simply take a sequence of your GetObjectRequest
instances and it will return an array of Task<MemoryStream>
. 在上面,您只需获取一系列GetObjectRequest
实例,它将返回一个Task<MemoryStream>
数组。 The fact that it returns a materialized sequence is important. 返回物化序列的事实很重要。 If you don't materialize it before returning, then the tasks will not be created until the sequence is iterated through. 如果在返回之前没有实现它,则在迭代序列之前不会创建任务。
Of course, if you want this behavior, then by all means, just remove the call to .ToArray()
, have the method return IEnumerable<Task<MemoryStream>>
and then the requests will be made as you iterate through the tasks. 当然,如果你想要这种行为,那么无论如何,只需删除对.ToArray()
的调用,让方法返回IEnumerable<Task<MemoryStream>>
,然后在迭代完成任务时进行请求。
From there, you can process them one at a time (using the Task.WaitAny
method in a loop) or wait for all of them to be completed (by calling the Task.WaitAll
method ). 从那里,您可以一次处理一个(在循环中使用Task.WaitAny
方法 )或等待所有这些完成(通过调用Task.WaitAll
方法 )。 An example of the latter would be: 后者的一个例子是:
static IList<MemoryStream> GetMemoryStreams(AmazonS3 s3,
IEnumerable<GetObjectRequest> requests)
{
Task<MemoryStream>[] tasks = GetMemoryStreamsAsync(s3, requests);
Task.WaitAll(tasks);
return tasks.Select(t => t.Result).ToList();
}
Also, it should be mentioned that this is a pretty good fit for the Reactive Extensions framework , as this very well-suited towards an IObservable<T>
implementation. 此外,应该提到的是,这非常适合Reactive Extensions框架 ,因为它非常适合IObservable<T>
实现。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.