简体   繁体   English

使用Amazon S3和Cloudfront智能缓存webapges

[英]Use Amazon S3 and Cloudfront for intelligently caching webapges

I have a website (running within Tomcat on Elastic Beanstalk) that generates artist discographies (a single page for one artist). 我有一个网站(在Elastic Beanstalk上的Tomcat中运行),可以生成艺术家唱片(一个艺术家的单页)。 This can be resource intensive, so as the artist pages don't change over a month period I put a CloudFront Distribution in front of it. 这可能是资源密集型的,因此艺术家页面在一个月内不会发生变化,因此我在其前面放置了CloudFront Distribution。

I thought this would mean no artist request ever had to be served more than once by my server however its not quite as good as that. 我认为这意味着我的服务器不会再多次提供艺术家请求,但这并不是那么好。 This post explains that every edge location (Europe, US etc.) will get a miss the first time they look up the resource and that there is a limit to how many resources are kept in the cloudfront cache so they could be dropped. 这篇文章解释了每个边缘位置(欧洲,美国等)在第一次查找资源时都会遇到错误,并且在云端缓存中保留了多少资源,因此可以删除它们。

So to counter this I have changed by server code to store a copy of the webpage in a bucket within S3 AND to check this first when a request comes in, so if the artist page already exists in S3 then the server retrieves it and returns its contents as the webpage. 所以为了解决这个问题,我已经通过服务器代码更改了在S3中的存储桶中存储网页的副本并在请求进入时首先检查这一点,因此如果艺术家页面已经存在于S3中,那么服务器将检索它并返回其内容作为网页。 This greatly reduces the processing as it only constructs a webpage for a particular artist once. 这大大减少了处理,因为它只为特定艺术家构建一次网页。

However: 然而:

  1. The request still has to go to the server to check if the artist page exists. 请求仍然必须转到服务器以检查艺术家页面是否存在。
  2. If the artist page exists then the webpage (and they can sometimes be large up-to 20mb) is first downloaded to the server and then server returns the page. 如果艺术家页面存在,那么首先将网页(并且它们有时可以大到20mb)最先下载到服务器,然后服务器返回页面。

So I wanted to know if I could improve this - I know you can construct an S3 bucket as a redirect to another website. 所以我想知道我是否可以改进这一点 - 我知道你可以构建一个S3存储桶作为重定向到另一个网站。 Is there a per-page way I could get the artist request to go to the S3 bucket and then have it return the page if it exists or call server if it does not? 是否有每页的方式我可以让艺术家请求转到S3存储桶,然后让它返回页面,如果它存在或调用服务器,如果它没有?

Alternatively could I get the server to check if page exists and then redirect to the S3 page rather than download the page to the server first? 或者我可以让服务器检查页面是否存在然后重定向到S3页面而不是先将页面下载到服务器?

OP says: OP说:

they can sometimes be large up-to 20mb 它们有时可以达到20mb

Since the volume of data you serve can be pretty large, I think it is feasible for you to do this in 2 requests instead of one, where you decouple the content generation from the content serving part. 由于您所服务的数据量可能非常大,我认为您可以在2个请求中执行此操作,而不是在一个请求中将内容生成与内容服务部分分离。 The reason to do this is so as to minimize the amount of time/resources it takes on the server to fetch data from S3 and serve it. 这样做的原因是为了最小化服务器从S3获取数据并为其提供服务所花费的时间/资源量。

AWS supports pre-signed URLs which can be valid for a short amount of time; AWS支持预签名的URL ,该URL可以在很短的时间内有效; We can try using the same here to avoid issues around security etc. 我们可以尝试使用相同的方法来避免安全性等问题。

Currently, your architecture looks something like below, wherein. 目前,您的架构如下所示,其中。 the client initiates a request, you check if the requested data exists on the S3 and then fetch and serve it if there, else you generate the content, and save it to S3: 客户端发起请求,检查S3上是否存在所请求的数据,然后在那里获取并提供它,否则生成内容,并将其保存到S3:

                           if exists on S3
client --------> server --------------------> fetch from s3 and serve
                    |
                    |else
                    |------> generate content -------> save to S3 and serve

In terms of network resources, you always consume 2X the amount of bandwidth and time here. 在网络资源方面,您总是消耗2倍的带宽和时间。 If the data exists, then once you have to pull it from server and serve it to customer (so it is 2X). 如果数据存在,那么一旦您必须从服务器提取数据并将其提供给客户(因此它是2X)。 If the data doesn't exist, you send it to customer and to S3 (so again it is 2X) 如果数据不存在,则将其发送给客户和S3(因此再次为2X)


Instead, you can try 2 approaches below, both of which assume that you have some base template, and that the other data can be fetched via AJAX calls, and both of which bring down that 2X factor in the overall architecture. 相反,您可以尝试以下两种方法,两种方法都假设您有一些基本模板,而其他数据可以通过AJAX调用获取,这两种方法都会降低整体架构中的2X因子。

  1. Serve the content from S3 only. 仅从S3提供内容。 This calls for changes to the way your product is designed, and hence may not be that easily integrable. 这需要改变产品的设计方式,因此可能不易于整合。

    Basically, for every incoming request, return the S3 URL for it if the data already exists, else create a task for it in SQS, generate the data and push it to S3. 基本上,对于每个传入的请求,如果数据已经存在,则返回它的S3 URL,否则在SQS中为它创建任务,生成数据并将其推送到S3。 Based on your usage patterns for different artists, you should be having an estimate of how much time it takes to pull together the data on the average, and so return a URL which would be valid with the estimated_time_for_completetion( T ) of the task. 根据您对不同艺术家的使用模式,您应该估计平均汇总数据所需的时间,因此返回一个对任务的estimated_time_for_completetion( T )有效的URL。

    The client waits for time T , and then makes the request to the URL returned earlier. 客户端等待时间T ,然后向先前返回的URL发出请求。 It makes upto say 3 attempts for fetching this data in case of failure. 在失败的情况下,最多可以尝试3次获取此数据。 In fact, the data already existing on S3 can be thought of as the base case when T = 0 . 实际上,当T = 0时,S3上已存在的数据可以被认为是基本情况。

    In this case, you make 2-4 network requests from the client, but only the first of those requests comes to your server. 在这种情况下,您从客户端发出2-4个网络请求,但只有第一个请求进入您的服务器。 You transmit the data once to S3 only in the case it doesn't exists and the client always pulls in from S3. 只有在数据不存在且客户端始终从S3拉入时,才会将数据一次传输到S3。

      if exists on S3, return URL client --------> server --------------------------------> s3 | |else SQS task |---------------> generate content -------> save to S3 return pre-computed url wait for time `T` client -------------------------> s3 


  1. Check if data already exists, and make second network call accordingly. 检查数据是否已存在,并相应地进行第二次网络呼叫。

    This is similar to what you currently do when serving data from the server in case it doesn't already exist. 这与您从服务器提供数据时的当前操作类似,以防它尚不存在。 Again, we make 2 requests here, however, this time we try to serve data synchronously from the server in the case it doesn't exist. 同样,我们在此处发出2个请求,但是,这次我们尝试在服务器不存在的情况下同步提供数据。

    So, in the first hit, we check if the content had ever been generated previously, in which case, we get a successful URL, or error message. 因此,在第一次点击中,我们检查以前是否曾生成过内容,在这种情况下,我们会获得成功的URL或错误消息。 When successful, the next hit goes to S3. 成功后,下一个命中转到S3。

    If the data doesn't exist on S3, we make a fresh request (to a different POST URL), on getting which, the server computes data, serves it, while adding an asynchronous task to push it to S3. 如果S3上不存在数据,我们会发出一个新的请求(到不同的POST URL),获取服务器计算数据,服务它,同时添加异步任务将其推送到S3。

      if exists on S3, return URL client --------> server --------------------------------> s3 client --------> server ---------> generate content -------> serve it | |---> add SQS task to push to S3 

CloudFront cache redirect, but does not follow it: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#ResponseCustomRedirects . CloudFront缓存重定向,但不遵循它: http//docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#ResponseCustomRedirects

You did not provide specific numbers, but will it work for you to pregenerate all these pages and put them to S3 and point CloudFront directly to S3? 您没有提供具体的数字,但它是否可以为您预生成所有这些页面并将它们放到S3并将CloudFront直接指向S3?

If it is doable, there are couple of benefits: 如果它是可行的,有几个好处:

  1. You will decouple content generation from content serving which will make system more stable in overall 您将内容生成与内容服务分离,这将使整个系统更加稳定
  2. Performance requirements for the content generator will be much lower as it could move as slowly as it wish regenerating content 内容生成器的性能要求会低得多,因为它可能会像希望重新生成内容一样缓慢移动

Definitely if you don't know which pages you have to generate in advance it won't work. 当然,如果你不知道你必须提前生成哪些页面,它将无法正常工作。

Although I've not done it before, this would be a technique I'd look at. 虽然我之前没有这样做,但这将是我要看的技巧。

  • Start by setting up the S3 bucket as you've described, as a "redirect" for a website. 首先按照您的描述设置S3存储桶,作为网站的“重定向”。

  • Have a look at the S3 Event Handlers. 看看S3事件处理程序。 They only deal with when an S3 object is created, but you could try doing a GET to start with and if it fails respond with a POST or PUT to that same path, placing in a "marker" file or calling an API that will trigger an event? 它们只处理创建S3对象的时间,但是您可以尝试执行GET开始,如果它失败,则使用POST或PUT响应同一路径,放入“marker”文件或调用将触发的API一个事件?

https://aws.amazon.com/blogs/aws/s3-event-notification/ http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html https://aws.amazon.com/blogs/aws/s3-event-notification/ http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

  • Once the event is triggered, either have your server listen via SQS for an event, or move your artist creator code into AWS Lambda which will feed off of SNS. 触发事件后,让您的服务器通过SQS监听事件,或者将您的艺术家创建者代码移动到AWS Lambda,这将以SNS为基础。

My only concern is where that GET will be coming from. 我唯一担心的是GET将来自哪里。 You don't want anyone hitting your S3 bucket with an invalid POST - you'd be generating all over the place. 您不希望任何人使用无效的POST命中您的S3存储桶 - 您将在整个地方生成。 But I'll leave that as an exercise for the reader. 但我会把它作为读者的练习。

Why not put a web server like ngx or apache in front of tomcat? 为什么不在tomcat前放置像ngx或apache这样的web服务器? Means tomat runs on some other port like 8085, web server runs on 80. It gets hits and has its own cache. 意味着tomat在8085之类的其他端口上运行,web服务器在80上运行。它获得命中并拥有自己的缓存。 Then you dont need S3 at all but can do back to your server + Cloudfront. 然后你根本不需要S3,但可以回到你的服务器+ Cloudfront。

So Cloudfront hits your web server, if its in cache, return page directly. 因此,如果Cloudfront在缓存中直接返回页面,则Cloudfront会对您的Web服务器进行命中。 Else go to tomcat. 否则去tomcat。

Cache can be in same process or a redis ... dependong on total size of data you need to cache. 缓存可以在同一个进程中,也可以是redis ...依赖于您需要缓存的数据的总大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM