简体   繁体   中英

How can I extract 'clusters' of images from a pdf file?

I am aiming to extract 'clusters' of images from pdf's. So if two or more images are all touching eachother, or within 1-2 pixels, I want them extracted as one image with a transparent background. Ideally a Linux solution would be optimal. Thanks!

IMHO, this is impossible for the general case...

Extracting images from a PDF page usually restores them in their original size. However, inside the PDF page they may have been embedded with a scaling or zooming factor applied. So each image may have a different resolution on the page.

Though it may just appear as if the images are 'touching' each other to fill another rectangle (or whatever shape), once they are extracted, they may not 'fit' together any more. Also, one part of an image may cover others, and after extracting, each image will be be visible in its full size again.

...unless you just do a cropped screenshot of the PDF page.

I think Docotic.Pdf library might be a right tool for your case.

The library can extract images "as painted", ie preserving scaling / rotation. And for each painted image location and size are also retrieved.

So you could use the library to retrieve images as painted on a page and analyze coordinates of each image to find / group clusters of images.

Below is a sample code that show how to extract images "as painted" and their coordinates.

public static void extractImagesAndCoordinates(string file)
{
    int imageIndex = 0;
    using (PdfDocument pdf = new PdfDocument(file))
    {
        foreach (PdfPage page in pdf.Pages)
        {
            PdfCollection<PdfPaintedImage> paintedImages = page.GetPaintedImages();
            foreach (PdfPaintedImage image in paintedImages)
            {
                Console.Out.WriteLine("Position {0}, {1}. Size {2}x{3}",
                    image.Position.X, image.Position.Y, image.Size.Width, image.Size.Height);

                string outImagePath = string.Format("image{0}.png", imageIndex++);
                image.SaveAsPainted(outImagePath, PdfExtractedImageFormat.Png);
            }
        }
    }
}

Disclaimer: I work for the vendor of the library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM