简体   繁体   中英

How can I get the resolution of an embeded image in a PDF using ITextSharp

I've built a method that tries to see if the resolution of all the embedded images in a given pdf is at least 300 PPI (print-worthy). What it does is cycle through each image on a page, and compare its width and height to the artbox width and height. It works successfully if there is only one image per page, but when there are multiple, the artbox size includes all of the images, throwing the numbers off.

I was hoping that somebody might have some idea of how to get the rectangle size the image is drawn in so I can compare correctly, or if there is an easier way to get the PPI of an image object (as it would be rendered in its rectangle, not in raw form).

this is the code for the aforementioned method

    private static bool AreImages300PPI(PdfDictionary pg)
    {
        var res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
        var xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
        if (xobj == null) return true;
        foreach (PdfName name in xobj.Keys)
        {
            PdfObject obj = xobj.Get(name);
            if (!obj.IsIndirect()) continue;
            var tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
            var type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
            var width = float.Parse(tg.Get(PdfName.WIDTH).ToString());
            var height = float.Parse(tg.Get(PdfName.HEIGHT).ToString());
            var artbox = (PdfArray) pg.Get(PdfName.ARTBOX);
            var pdfRect = new PdfRectangle(float.Parse(artbox[0].ToString()), float.Parse(artbox[1].ToString()),
                float.Parse(artbox[2].ToString()), float.Parse(artbox[3].ToString()));

            if (PdfName.IMAGE.Equals(type) && (width < pdfRect.Width*300/72 || height < pdfRect.Height*300/72)
                || ((PdfName.FORM.Equals(type) || PdfName.GROUP.Equals(type)) && !AreImages300PPI(tg)))
            {
                return false;
            }
        }
        return true;
    }

for reference, here is the method that calls it:

    internal static List<string> GetLowResWarnings(string MergedPDFPath)
    {
        var returnlist = new List<string>();
        using (PdfReader pdf = new PdfReader(MergedPDFPath))
        {
                for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
                {
                    var pg = pdf.GetPageN(pageNumber);
                    if (!AreImages300PPI(pg))
                        returnlist.Add(pageNumber.ToString());
                }
        }
        return returnlist;
    }

Thanks for any help you can provide.

Can I put you down a totally different path? You're looking at images that live in the global file but you're not seeing how they're used in a page.

iTextSharp has a class called iTextSharp.text.pdf.parser.PdfReaderContentParser that can walk a PdfReader and tell you things about it. You can subscribe to information by implementing the iTextSharp.text.pdf.parser.IRenderListener interface. For each image that it encounters the RenderImage method of your class will be called with an iTextSharp.text.pdf.parser.ImageRenderInfo object. From this object you can get both the actual image as well as the current transformation matrix which will tell you how the image is placed into the document.

Using this information you could create a class like this:

public class MyImageRenderListener : iTextSharp.text.pdf.parser.IRenderListener {
    //For each page keep a list of various image info
    public Dictionary<int, List<ImageScaleInfo>> Pages = new Dictionary<int, List<ImageScaleInfo>>();

    //Need to manually change the page when using this
    public int CurrentPage { get; set; }

    //Pass through the current page units
    public Single CurrentPageUnits { get; set; }

    //Not used, just interface contracts
    public void BeginTextBlock() { }
    public void EndTextBlock() { }
    public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) { }

    //Called for each image
    public void RenderImage(iTextSharp.text.pdf.parser.ImageRenderInfo renderInfo) {
        //Get the basic image info
        var img = renderInfo.GetImage().GetDrawingImage();
        var imgWidth = img.Width;
        var imgHeight = img.Height;
        img.Dispose();

        //Get the current transformation matrix
        var ctm = renderInfo.GetImageCTM();
        var ctmWidth = ctm[iTextSharp.text.pdf.parser.Matrix.I11];
        var ctmHeight = ctm[iTextSharp.text.pdf.parser.Matrix.I22];

        //Create new key for our page number if it doesn't exist already
        if (!this.Pages.ContainsKey(CurrentPage)) {
            this.Pages.Add(CurrentPage, new List<ImageScaleInfo>());
        }

        //Add our image info to this page
        this.Pages[CurrentPage].Add(new ImageScaleInfo(imgWidth, imgHeight, ctmWidth, ctmHeight, this.CurrentPageUnits));
    }
}

It uses this helper class to store our information:

public class ImageScaleInfo {
    //The page's unit space, almost always 72
    public Single PageUnits { get; set; }

    //The image's actual dimensions
    public System.Drawing.SizeF ImgSize { get; set; }

    //How the image is placed into the page
    public System.Drawing.SizeF CtmSize { get; set; }

    //Automatically calculate how the image is scaled
    public Single ImgWidthScale { get { return ImgSize.Width / CtmSize.Width; } }
    public Single ImgHeightScale { get { return ImgSize.Height / CtmSize.Height; } }

    //Helper constructor
    public ImageScaleInfo(Single imgWidth, Single imgHeight, Single ctmWidth, Single ctmHeight, Single pageUnits) {
        this.ImgSize = new System.Drawing.SizeF(imgWidth, imgHeight);
        this.CtmSize = new System.Drawing.SizeF(ctmWidth, ctmHeight);
        this.PageUnits = pageUnits;
    }
}

Using it is really simple then:

//Create an instance of our helper class
var imgList = new MyImageRenderListener();

//Parse the PDF and inspect each image
using (var reader = new PdfReader(testFile)) {
    var proc = new iTextSharp.text.pdf.parser.PdfReaderContentParser(reader);
    for (var i = 1; i <= reader.NumberOfPages; i++) {
        //Get the page object itself
        var p = reader.GetPageN(i);

        //Get the page units. Per spec, page units are expressed as multiples of 1/72 of an inch with a default of 72.
        var pageUnits = (p.Contains(PdfName.USERUNIT) ? p.GetAsNumber(PdfName.USERUNIT).FloatValue : 72);

        //Set the page number so we can find it later
        imgList.CurrentPage = i;
        imgList.CurrentPageUnits = pageUnits;

        //Process the page
        proc.ProcessContent(i, imgList);
    }
}

//Dump out some information
foreach (var p in imgList.Pages) {
    foreach (var i in p.Value) {
        Console.WriteLine(String.Format("Image PPI is {0}x{1}", i.ImgWidthScale * i.PageUnits, i.ImgHeightScale * i.PageUnits));
    }
}

EDIT

From @BrunoLowagie's comments below I've updated the above to remove the "magic 72" and actually try querying the document to see if this has been overridden. Very unlikely to happen but someone in a year or two will find some obscure PDF and complain that this code doesn't work so better safe than sorry.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM