简体   繁体   English

在Java Servlet中流式传输大文件

[英]Streaming large files in a java servlet

I am building a java server that needs to scale. 我正在构建需要扩展的Java服务器。 One of the servlets will be serving images stored in Amazon S3. Servlet之一将提供存储在Amazon S3中的图像。

Recently under load, I ran out of memory in my VM and it was after I added the code to serve the images so I'm pretty sure that streaming larger servlet responses is causing my troubles. 最近在负载下,我的VM内存不足,这是在我添加了用于提供图像的代码之后,因此,我很确定流较大的servlet响应会引起我的麻烦。

My question is : is there any best practice in how to code a java servlet to stream a large (>200k) response back to a browser when read from a database or other cloud storage? 我的问题是:从数据库或其他云存储读取数据时,如何编写Java Servlet以便将大型(> 200k)响应流回浏览器,是否有最佳实践?

I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. 我考虑过将文件写入本地临时驱动器,然后生成另一个线程来处理流,以便可以重新使用tomcat servlet线程。 This seems like it would be io heavy. 这似乎很沉重。

Any thoughts would be appreciated. 任何想法将不胜感激。 Thanks. 谢谢。

When possible, you should not store the entire contents of a file to be served in memory. 如果可能,您不应将要提供的文件的全部内容存储在内存中。 Instead, aquire an InputStream for the data, and copy the data to the Servlet OutputStream in pieces. 取而代之的是,为数据获取InputStream,并将数据分段地复制到Servlet OutputStream。 For example: 例如:

ServletOutputStream out = response.getOutputStream();
InputStream in = [ code to get source input stream ];
String mimeType = [ code to get mimetype of data to be served ];
byte[] bytes = new byte[FILEBUFFERSIZE];
int bytesRead;

response.setContentType(mimeType);

while ((bytesRead = in.read(bytes)) != -1) {
    out.write(bytes, 0, bytesRead);
}

// do the following in a finally block:
in.close();
out.close();

I do agree with toby, you should instead "point them to the S3 url." 我确实同意toby,您应该改为“将它们指向S3 url”。

As for the OOM exception, are you sure it has to do with serving the image data? 至于OOM异常,您确定它与提供图像数据有关吗? Let's say your JVM has 256MB of "extra" memory to use for serving image data. 假设您的JVM具有256MB的“额外”内存,可用于提供图像数据。 With Google's help, "256MB / 200KB" = 1310. For 2GB "extra" memory (these days a very reasonable amount) over 10,000 simultaneous clients could be supported. 在Google的帮助下,“ 256MB / 200KB” =1310。对于2GB的“额外”内存(目前这是一个非常合理的数量),可以支持10,000个并发客户端。 Even so, 1300 simultaneous clients is a pretty large number. 即便如此,1300个并发客户端仍然是一个很大的数目。 Is this the type of load you experienced? 这是您经历过的负载类型吗? If not, you may need to look elsewhere for the cause of the OOM exception. 如果不是,则可能需要在其他地方查找OOM异常的原因。

Edit - Regarding: 编辑-关于:

In this use case the images can contain sensitive data... 在这种情况下,图像可能包含敏感数据...

When I read through the S3 documentation a few weeks ago, I noticed that you can generate time-expiring keys that can be attached to S3 URLs. 几周前阅读S3文档时,我注意到您可以生成可以附加到S3 URL的过期密钥。 So, you would not have to open up the files on S3 to the public. 因此,您不必公开S3上的文件。 My understanding of the technique is: 我对这项技术的理解是:

  1. Initial HTML page has download links to your webapp 初始HTML页面具有指向您的Web应用程序的下载链接
  2. User clicks on a download link 用户点击下载链接
  3. Your webapp generates an S3 URL that includes a key that expires in, lets say, 5 minutes. 您的Web应用程序会生成一个S3 URL,其中包含一个密钥,该密钥将在5分钟内过期。
  4. Send an HTTP redirect to the client with the URL from step 3. 使用步骤3中的URL将HTTP重定向发送到客户端。
  5. The user downloads the file from S3. 用户从S3下载文件。 This works even if the download takes more than 5 minutes - once a download starts it can continue through completion. 即使下载时间超过5分钟,此方法仍然有效-下载开始后,它就可以继续完成。

Why wouldn't you just point them to the S3 url? 您为什么不只将它们指向S3网址? Taking an artifact from S3 and then streaming it through your own server to me defeats the purpose of using S3, which is to offload the bandwidth and processing of serving the images to Amazon. 从S3中获取工件,然后通过您自己的服务器将其流式传输给我,这使使用S3的目的无法实现,后者是将带宽和将图像提供给Amazon的处理工作卸载了。

I've seen a lot of code like john-vasilef's (currently accepted) answer, a tight while loop reading chunks from one stream and writing them to the other stream. 我已经看过很多代码,例如john-vasilef的(当前接受的)答案,在紧紧的循环中从一个流中读取块并将它们写入另一流中。

The argument I'd make is against needless code duplication, in favor of using Apache's IOUtils. 我要提出的观点是反对不必要的代码重复,而赞成使用Apache的IOUtils。 If you are already using it elsewhere, or if another library or framework you're using is already depending on it, it's a single line that is known and well-tested. 如果您已经在其他地方使用过它,或者如果您正在使用的另一个库或框架已经依赖于它,则这是一条已知且经过良好测试的行。

In the following code, I'm streaming an object from Amazon S3 to the client in a servlet. 在以下代码中,我正在将对象从Amazon S3流传输到Servlet中的客户端。

import java.io.InputStream;
import java.io.OutputStream;
import org.apache.commons.io.IOUtils;

InputStream in = null;
OutputStream out = null;

try {
    in = object.getObjectContent();
    out = response.getOutputStream();
    IOUtils.copy(in, out);
} finally {
    IOUtils.closeQuietly(in);
    IOUtils.closeQuietly(out);
}

6 lines of a well-defined pattern with proper stream closing seems pretty solid. 6行定义明确的模式,具有正确的流关闭效果,看起来非常可靠。

I agree strongly with both toby and John Vasileff--S3 is great for off loading large media objects if you can tolerate the associated issues. 我非常同意toby和John Vasileff的观点,如果可以忍受相关问题,S3非常适合卸载大型媒体对象。 (An instance of own app does that for 10-1000MB FLVs and MP4s.) Eg: No partial requests (byte range header), though. (自己的应用程序实例可以处理10-1000MB的FLV和MP4。)例如:不过,没有部分请求(字节范围标头)。 One has to handle that 'manually', occasional down time, etc.. 必须“手动”处理,偶尔停机等。

If that is not an option, John's code looks good. 如果那不是一个选择,John的代码看起来不错。 I have found that a byte buffer of 2k FILEBUFFERSIZE is the most efficient in microbench marks. 我发现2k FILEBUFFERSIZE的字节缓冲区在微基准标记中是最有效的。 Another option might be a shared FileChannel. 另一个选项可能是共享的FileChannel。 (FileChannels are thread-safe.) (FileChannel是线程安全的。)

That said, I'd also add that guessing at what caused an out of memory error is a classic optimization mistake. 也就是说,我还要补充一点,猜测造成内存不足错误的原因是经典的优化错误。 You would improve your chances of success by working with hard metrics. 通过使用严格的指标,您将提高成功的机会。

  1. Place -XX:+HeapDumpOnOutOfMemoryError into you JVM startup parameters, just in case 以防万一-XX:+ HeapDumpOnOutOfMemoryError放入您的JVM启动参数中
  2. take use jmap on the running JVM ( jmap -histo <pid> ) under load 在负载下在正在运行的JVM( jmap -histo <pid> )上使用jmap
  3. Analyize the metrics (jmap -histo out put, or have jhat look at your heap dump). 分析指标(jmap -histo输出,或让jhat查看堆转储)。 It very well may be that your out of memory is coming from somewhere unexpected. 很有可能是您的内存不足来自意外的地方。

There are of course other tools out there, but jmap & jhat come with Java 5+ 'out of the box' 当然还有其他工具,但是Java 5+附带的jmap和jhat都是“开箱即用”的

I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. 我考虑过将文件写入本地临时驱动器,然后生成另一个线程来处理流,以便可以重新使用tomcat servlet线程。 This seems like it would be io heavy. 这似乎很沉重。

Ah, I don't think you can't do that. 啊,我不认为你不能那样做。 And even if you could, it sounds dubious. 即使可以,听起来也很可疑。 The tomcat thread that is managing the connection needs to in control. 管理连接的tomcat线程需要控制。 If you are experiencing thread starvation then increase the number of available threads in ./conf/server.xml. 如果遇到线程不足,请增加./conf/server.xml中的可用线程数。 Again, metrics are the way to detect this--don't just guess. 同样,指标是检测到此问题的方法-不仅仅是猜测。

Question: Are you also running on EC2? 问题:您还在EC2上运行吗? What are your tomcat's JVM start up parameters? 您的tomcat的JVM启动参数是什么?

toby is right, you should be pointing straight to S3, if you can. toby是正确的,如果可以的话,您应该直接指向S3。 If you cannot, the question is a little vague to give an accurate response: How big is your java heap? 如果不能,那么这个问题有点含糊,无法给出准确的答案:您的Java堆有多大? How many streams are open concurrently when you run out of memory? 内存不足时,同时打开多少个流?
How big is your read write/bufer (8K is good)? 您的读/写缓冲区有多大(8K好)?
You are reading 8K from the stream, then writing 8k to the output, right? 您正在从流中读取8K,然后将8k写入输出中,对吗? You are not trying to read the whole image from S3, buffer it in memory, then sending the whole thing at once? 您不是要从S3读取整个图像,将其缓存在内存中,然后一次发送整个图像吗?

If you use 8K buffers, you could have 1000 concurrent streams going in ~8Megs of heap space, so you are definitely doing something wrong.... 如果使用8K缓冲区,则可能有1000个并发流进入〜8Megs的堆空间中,因此您肯定做错了...。

BTW, I did not pick 8K out of thin air, it is the default size for socket buffers, send more data, say 1Meg, and you will be blocking on the tcp/ip stack holding a large amount of memory. 顺便说一句,我不是凭空挑出8K的,这是套接字缓冲区的默认大小,发送更多的数据(例如1Meg),您将在tcp / ip堆栈上阻塞以容纳大量内存。

如果您可以对文件进行结构化,以使静态文件分离并位于各自的存储桶中,则可以通过使用Amazon S3 CDN CloudFront实现当今最快的性能。

You have to check two things: 您必须检查两件事:

  • Are you closing the stream? 您要关闭流吗? Very important 很重要
  • Maybe you're giving stream connections "for free". 也许您是“免费”提供流连接。 The stream is not large, but many many streams at the same time can steal all your memory. 流不是很大,但是同时许多流可以窃取您的所有内存。 Create a pool so that you cannot have a certain number of streams running at the same time 创建一个池,以使您不能同时运行一定数量的流

In addition to what John suggested, you should repeatedly flush the output stream. 除了John建议的内容之外,您还应该重复刷新输出流。 Depending on your web container, it is possible that it caches parts or even all of your output and flushes it at-once (for example, to calculate the Content-Length header). 根据您的Web容器,它可能会缓存部分甚至全部输出,并一次刷新一次(例如,计算Content-Length标头)。 That would burn quite a bit of memory. 那会消耗很多内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM