简体   繁体   English

JAI:如何从多页TIFF图像容器中提取单页输入流?

[英]JAI: How do I extract a single page input stream from a multipaged TIFF image container?

I have a component that converts PDF documents to images, one image per page . 我有一个将PDF文档转换为图像的组件, 每页一个图像 Since the component uses converters producing in-memory images, it hits the JVM heap heavily and takes some time to finish conversions. 由于该组件使用生成内存映像的转换器,因此它严重打击了JVM堆,并需要一些时间来完成转换。

I'm trying to improve the overall performance of the conversion process, and found a native library with a JNI binding to convert PDFs to TIFFs. 我试图改善转换过程的整体性能,并发现了一个具有JNI绑定的本机库,可将PDF转换为TIFF。 That library can convert PDFs to single TIFF files only (requires intermediate file system storage; does not even consume conversion streams), therefore result TIFF files have converted pages embedded, and not per-page images on the file system. 该库只能将PDF转换为单个TIFF文件(需要中间文件系统存储;甚至不使用转换流),因此,结果TIFF文件已嵌入已转换的页面,而不是文件系统上的每页图像。 Having a native library improves the overall conversion drastically and the performance gets really faster, but there is a real bottleneck: since I have to make a source-page to destination-page conversion, now I must extract every page from the result file and write all of them elsewhere. 拥有本机库可以显着提高整体转换速度,并且性能得到提高,但是确实存在瓶颈:由于我必须进行从源页面到目标页面的转换,所以现在我必须从结果文件中提取每个页面并编写他们都在别处。 A simple and naive approach with RenderedImage s: 使用RenderedImage的一种简单而幼稚的方法:

final SeekableStream seekableStream = new FileSeekableStream(tempFile);
final ImageDecoder imageDecoder = createImageDecoder("tiff", seekableStream, null);
...
//                                               V--- heap is wasted here
final RenderedImage renderedImage = imageDecoder.decodeAsRenderedImage(pageNumber);
// ... do the rest stuff ...

Actually speaking, I would really like just to extract a concrete page input stream from the TIFF container file ( tempFile ) and just redirect it to elsewhere without having it to be stored as an in-memory image. 实际上,我真的很想从TIFF容器文件( tempFile )中提取一个具体的页面输入流,然后将其重定向到其他位置,而不必将其存储为内存图像。 I would imagine an approach similar to containers processing where I need to seek for a specific entry to extract data from it (say, something like ZIP files processing, etc). 我会想象一种类似于容器处理的方法,在这种方法中,我需要寻找一个特定的条目来从中提取数据(例如,诸如ZIP文件处理之类的东西)。 But I couldn't find anything like that in ImageDecoder , or I'm probably wrong with my expectations and just missing something important here... 但是我在ImageDecoder找不到类似的ImageDecoder ,或者我的期望可能错了,只是在这里缺少了一些重要的东西...

Is it possible to extract TIFF container page input streams using JAI API or probably third-party alternatives? 是否可以使用JAI API或第三方替代品来提取TIFF容器页面输入流? Thanks in advance. 提前致谢。

I could be wrong, but don't think JAI has support for splitting TIFFs without decoding the files to in-memory images. 我可能是错的,但不要认为JAI支持在不将文件解码为内存图像的情况下拆分TIFF。 And, sorry for promoting my own library, but I think it does exactly what you need (the main part of the solution used to split TIFFs is contributed by a third party). 而且,很抱歉推广我自己的库,但是我认为它完全可以满足您的需要(用于拆分TIFF的解决方案的主要部分由第三方提供)。

By using the TIFFUtilities class from com.twelvemonkeys.contrib.tiff , you should be able to split your multi-page TIFF to multiple single-page TIFFs like this: 通过使用TIFFUtilities从类com.twelvemonkeys.contrib.tiff ,你应该能够在您的多页TIFF分割到多个单页TIFF这样的:

TIFFUtilities.split(tempFile, new File("output"));

No decoding of the images are done, only splitting each IFD into a separate file, and writing the streams with corrected offsets and byte counts. 不对图像进行解码,仅将每个IFD拆分为一个单独的文件,并使用已校正的偏移量和字节数写入流。

Files will be named output/0001.tif , output/0002.tif etc. If you need more control over the output name or have other requirements, you can easily modify the code. 文件将被命名为output/0001.tifoutput/0002.tif等。如果您需要对输出名称的更多控制或有其他要求,则可以轻松地修改代码。 The code comes with a BSD-style license. 该代码带有BSD样式的许可证。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM