[英]JAI: How do I extract a single page input stream from a multipaged TIFF image container?
I have a component that converts PDF documents to images, one image per page . 我有一个将PDF文档转换为图像的组件, 每页一个图像 。 Since the component uses converters producing in-memory images, it hits the JVM heap heavily and takes some time to finish conversions.
由于该组件使用生成内存映像的转换器,因此它严重打击了JVM堆,并需要一些时间来完成转换。
I'm trying to improve the overall performance of the conversion process, and found a native library with a JNI binding to convert PDFs to TIFFs. 我试图改善转换过程的整体性能,并发现了一个具有JNI绑定的本机库,可将PDF转换为TIFF。 That library can convert PDFs to single TIFF files only (requires intermediate file system storage; does not even consume conversion streams), therefore result TIFF files have converted pages embedded, and not per-page images on the file system.
该库只能将PDF转换为单个TIFF文件(需要中间文件系统存储;甚至不使用转换流),因此,结果TIFF文件已嵌入已转换的页面,而不是文件系统上的每页图像。 Having a native library improves the overall conversion drastically and the performance gets really faster, but there is a real bottleneck: since I have to make a source-page to destination-page conversion, now I must extract every page from the result file and write all of them elsewhere.
拥有本机库可以显着提高整体转换速度,并且性能得到提高,但是确实存在瓶颈:由于我必须进行从源页面到目标页面的转换,所以现在我必须从结果文件中提取每个页面并编写他们都在别处。 A simple and naive approach with
RenderedImage
s: 使用
RenderedImage
的一种简单而幼稚的方法:
final SeekableStream seekableStream = new FileSeekableStream(tempFile);
final ImageDecoder imageDecoder = createImageDecoder("tiff", seekableStream, null);
...
// V--- heap is wasted here
final RenderedImage renderedImage = imageDecoder.decodeAsRenderedImage(pageNumber);
// ... do the rest stuff ...
Actually speaking, I would really like just to extract a concrete page input stream from the TIFF container file ( tempFile
) and just redirect it to elsewhere without having it to be stored as an in-memory image. 实际上,我真的很想从TIFF容器文件(
tempFile
)中提取一个具体的页面输入流,然后将其重定向到其他位置,而不必将其存储为内存图像。 I would imagine an approach similar to containers processing where I need to seek for a specific entry to extract data from it (say, something like ZIP files processing, etc). 我会想象一种类似于容器处理的方法,在这种方法中,我需要寻找一个特定的条目来从中提取数据(例如,诸如ZIP文件处理之类的东西)。 But I couldn't find anything like that in
ImageDecoder
, or I'm probably wrong with my expectations and just missing something important here... 但是我在
ImageDecoder
找不到类似的ImageDecoder
,或者我的期望可能错了,只是在这里缺少了一些重要的东西...
Is it possible to extract TIFF container page input streams using JAI API or probably third-party alternatives? 是否可以使用JAI API或第三方替代品来提取TIFF容器页面输入流? Thanks in advance.
提前致谢。
I could be wrong, but don't think JAI has support for splitting TIFFs without decoding the files to in-memory images. 我可能是错的,但不要认为JAI支持在不将文件解码为内存图像的情况下拆分TIFF。 And, sorry for promoting my own library, but I think it does exactly what you need (the main part of the solution used to split TIFFs is contributed by a third party).
而且,很抱歉推广我自己的库,但是我认为它完全可以满足您的需要(用于拆分TIFF的解决方案的主要部分由第三方提供)。
By using the TIFFUtilities
class from com.twelvemonkeys.contrib.tiff
, you should be able to split your multi-page TIFF to multiple single-page TIFFs like this: 通过使用
TIFFUtilities
从类com.twelvemonkeys.contrib.tiff
,你应该能够在您的多页TIFF分割到多个单页TIFF这样的:
TIFFUtilities.split(tempFile, new File("output"));
No decoding of the images are done, only splitting each IFD into a separate file, and writing the streams with corrected offsets and byte counts. 不对图像进行解码,仅将每个IFD拆分为一个单独的文件,并使用已校正的偏移量和字节数写入流。
Files will be named output/0001.tif
, output/0002.tif
etc. If you need more control over the output name or have other requirements, you can easily modify the code. 文件将被命名为
output/0001.tif
, output/0002.tif
等。如果您需要对输出名称的更多控制或有其他要求,则可以轻松地修改代码。 The code comes with a BSD-style license. 该代码带有BSD样式的许可证。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.