简体   繁体   English

pdfbox和itext使用不正确的dpi提取图像

[英]pdfbox and itext extracting image with incorrect dpi

When I extract an image using pdfbox I am getting incorrect dpi of the image for some PDFs. 当我使用pdfbox提取图像时,我得到一些PDF的图像dpi不正确。 When I extract an image using Photoshop or Acrobat Reader Pro I can see that the dpi of the image is 200 using windows photo viewer, but when I extract the image using pdfbox the dpi is 72. 当我使用Photoshop或Acrobat Reader Pro提取图像时,我可以看到使用Windows照片查看器的图像的dpi为200,但是当我使用pdfbox提取图像时,dpi为72。

For extracting the image I am using following code : Not able to extract images from PDFA1-a format document 为了提取图像我使用以下代码: 无法从PDFA1-格式文档中提取图像

When I check the logs I see an unusual entry: 2015-01-23-main--DEBUG-org.apache.pdfbox.util.TIFFUtil: 当我查看日志时,我看到一个不寻常的条目:2015-01-23-main - DEBUG-org.apache.pdfbox.util.TIFFUtil:

     <?xml version="1.0" encoding="UTF-8"?><javax_imageio_jpeg_image_1.0>
      <JPEGvariety>
    <app0JFIF majorVersion="1" minorVersion="2" resUnits="0" Xdensity="1" Ydensity="1" thumbWidth="0" thumbHeight="0"/>
  </JPEGvariety>
  <markerSequence>
    <dqt>
      <dqtable elementPrecision="0" qtableId="0"/>
      <dqtable elementPrecision="0" qtableId="1"/>
    </dqt>
    <dht>
      <dhtable class="0" htableId="0"/>
      <dhtable class="0" htableId="1"/>
      <dhtable class="1" htableId="0"/>
      <dhtable class="1" htableId="1"/>
    </dht>
    <sof process="0" samplePrecision="8" numLines="0" samplesPerLine="0" numFrameComponents="3">
      <componentSpec componentId="1" HsamplingFactor="2" VsamplingFactor="2" QtableSelector="0"/>
      <componentSpec componentId="2" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/>
      <componentSpec componentId="3" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/>
    </sof>
    <sos numScanComponents="3" startSpectralSelection="0" endSpectralSelection="63" approxHigh="0" approxLow="0">
      <scanComponentSpec componentSelector="1" dcHuffTable="0" acHuffTable="0"/>
      <scanComponentSpec componentSelector="2" dcHuffTable="1" acHuffTable="1"/>
      <scanComponentSpec componentSelector="3" dcHuffTable="1" acHuffTable="1"/>
    </sos>
  </markerSequence>
</javax_imageio_jpeg_image_1.0>

I tried to google but I can see to find out what pdfbox means by this log. 我试图谷歌,但我可以看到通过此日志找出pdfbox的含义。 What does this mean? 这是什么意思?

You can download a sample pdf with this problem from this link: http://myslams.com/test/1.pdf 您可以从以下链接下载带有此问题的示例pdf: http//myslams.com/test/1.pdf

I have even tried itext but it is extracting image with 96 dpi. 我甚至尝试过itext,但是用96 dpi提取图像。

Am I doing something wrong? 难道我做错了什么? Or pdfbox and itext have this limitation? 或pdfbox和itext有这个限制?

After some digging I found your 1.pdf. 经过一番挖掘,我找到了你的1.pdf。 Thus,... 从而,...

PDFBox PDFBox的

In comments to this recent answer @Tilman and you were discussing this older answer in which @Tilman pointed towards the PrintImageLocations PDFBox example. 在对最近的答案 @Tilman的评论中,您正在讨论这个较旧的答案 ,其中@Tilman指向PrintImageLocations PDFBox示例。 I ran it for your file and got: 我为你的文件运行它并获得:

Processing page: 0
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 888px
size = 613.44, 319.68
size = 8.52in, 4.44in
size = 216.408mm, 112.776mm

Processing page: 1
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 2800px
size = 613.44, 1008.0
size = 8.52in, 14.0in
size = 216.408mm, 355.6mm

Processing page: 2
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 2800px
size = 613.44, 1008.0
size = 8.52in, 14.0in
size = 216.408mm, 355.6mm

Processing page: 3
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 1464px
size = 613.44, 527.04
size = 8.52in, 7.3199997in
size = 216.408mm, 185.928mm

On all pages this amounts to 200 dpi both in x and y directions (1704px / 8.52in = 888px / 4.44in = 2800px / 14.0in = 1464px / 7.32in = 200 dpi). 在所有页面上,这在x和y方向上均为200dpi(1704px / 8.52in = 888px / 4.44in = 2800px / 14.0in = 1464px / 7.32in = 200dpi)。

So PDFBox gives you the dpi values you are after. 因此,PDFBox为您提供了您所追求的dpi值。

(@Tilman: The current 2.0.0-SNAPSHOT version of that sample returns utter nonsense; you might want to fix this.) (@Tilman:该示例的当前2.0.0-SNAPSHOT版本返回完全无稽之谈;您可能想要解决此问题。)

iText iText的

A simplified iText version of that PDFBox example would be this: 该PDFBox示例的简化iText版本将是:

public void printImageLocations(InputStream stream) throws IOException
{
    PdfReader reader = new PdfReader(stream);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    ImageRenderListener listener = new ImageRenderListener();

    for (int page = 1; page <= reader.getNumberOfPages(); page++)
    {
        System.out.printf("\nPage %s:\n", page);
        parser.processContent(page, listener);
    }
}

static class ImageRenderListener implements RenderListener
{
    public void beginTextBlock() { }
    public void renderText(TextRenderInfo renderInfo) { }
    public void endTextBlock() { }

    public void renderImage(ImageRenderInfo renderInfo)
    {
        try
        {
            PdfDictionary imageDict = renderInfo.getImage().getDictionary();

            float widthPx = imageDict.getAsNumber(PdfName.WIDTH).floatValue(); 
            float heightPx = imageDict.getAsNumber(PdfName.HEIGHT).floatValue();
            float widthUu = renderInfo.getImageCTM().get(Matrix.I11);
            float heigthUu = renderInfo.getImageCTM().get(Matrix.I22);

            System.out.printf("Image %.0fpx*%.0fpx, %.0fuu*%.0fuu, %.2fin*%.2fin\n", widthPx, heightPx, widthUu, heigthUu, widthUu/72, heigthUu/72);
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

(Beware: I assumed unrotated and unskewed images.) (注意:我假设没有旋转和未图像的图像。)

The results for your file: 您的文件的结果:

Page 1:
Image 1704px*888px, 613uu*320uu, 8,52in*4,44in

Page 2:
Image 1704px*2800px, 613uu*1008uu, 8,52in*14,00in

Page 3:
Image 1704px*2800px, 613uu*1008uu, 8,52in*14,00in

Page 4:
Image 1704px*1464px, 613uu*527uu, 8,52in*7,32in

Thus, also 200dpi all along. 因此,一直也是200dpi。 So iText, too, gives you the dpi values you are after. 所以iText也为你提供了你所追求的dpi值。

Your code 你的代码

Obviously the code you referenced had no chance to report a dpi value sensible in the context of the PDF because it only extracts the images as found in the resources but ignores how the respective image resource is used on the page. 显然, 您引用代码没有机会在PDF的上下文中报告合理的dpi值,因为它只提取资源中找到的图像,但忽略在页面上使用相应图像资源的方式

An image resource can be stretched, rotated, skewed, ... any way the author likes when he uses it in the page content. 图像资源可以被拉伸,旋转,倾斜,...当他在页面内容中使用它时,作者喜欢的任何方式。

BTW, a dpi value only makes sense if the author did not skew and rotated only by a multiple of 90°. 顺便说一下,如果作者没有倾斜并且仅旋转90°的倍数,则dpi值才有意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM