簡體   English   中英

如何使用 PDFSharp 從 PDF 中提取 FlateDecoded 圖像

[英]How to extract FlateDecoded Images from PDF with PDFSharp

如何使用 PDFSharp 從 PDF 文檔中提取經過 FlateDecoded(例如 PNG)的圖像?

我在 PDFSharp 示例中發現了該評論:

// TODO: You can put the code here that converts vom PDF internal image format to a
// Windows bitmap
// and use GDI+ to save it in PNG format.
// [...]
// Take a look at the file
// PdfSharp.Pdf.Advanced/PdfImage.cs to see how we create the PDF image formats.

有人能解決這個問題嗎?

感謝您的回復。

編輯:因為我無法在 8 小時內回答我自己的問題,所以我這樣做了:

感謝您的快速回復。

我在方法“ExportAsPngImage”中添加了一些代碼,但沒有得到想要的結果。 它只是提取了更多圖像 (png),它們沒有正確的 colors 並且被扭曲了。

這是我的實際代碼:

PdfSharp.Pdf.Filters.FlateDecode flate = new PdfSharp.Pdf.Filters.FlateDecode();
        byte[] decodedBytes = flate.Decode(bytes);

        System.Drawing.Imaging.PixelFormat pixelFormat;

        switch (bitsPerComponent)
        {
            case 1:
                pixelFormat = PixelFormat.Format1bppIndexed;
                break;
            case 8:
                pixelFormat = PixelFormat.Format8bppIndexed;
                break;
            case 24:
                pixelFormat = PixelFormat.Format24bppRgb;
                break;
            default:
                throw new Exception("Unknown pixel format " + bitsPerComponent);
        }

        Bitmap bmp = new Bitmap(width, height, pixelFormat);
        var bmpData = bmp.LockBits(new Rectangle(0, 0, width, height), ImageLockMode.WriteOnly, pixelFormat);
        int length = (int)Math.Ceiling(width * bitsPerComponent / 8.0);
        for (int i = 0; i < height; i++)
        {
            int offset = i * length;
            int scanOffset = i * bmpData.Stride;
            Marshal.Copy(decodedBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
        }
        bmp.UnlockBits(bmpData);
        using (FileStream fs = new FileStream(@"C:\Export\PdfSharp\" + String.Format("Image{0}.png", count), FileMode.Create, FileAccess.Write))
        {
            bmp.Save(fs, System.Drawing.Imaging.ImageFormat.Png);
        }

那是正確的方法嗎? 還是我應該選擇另一種方式? 非常感謝!

我知道這個答案可能要晚幾年,但也許會對其他人有所幫助。

在我的情況下會出現這種情況,因為image.Elements.GetInteger(PdfImage.Keys.BitsPerComponent)似乎沒有返回正確的值。 正如Vive la déraison在您的問題下指出的那樣,您獲得了使用Marshal.Copy的 BGR 格式。 因此,在執行Marshal.Copy后反轉字節並旋轉 Bitmap 將完成這項工作。

生成的代碼如下所示:

private static void ExportAsPngImage(PdfDictionary image, ref int count)
    {
        int width = image.Elements.GetInteger(PdfImage.Keys.Width);
        int height = image.Elements.GetInteger(PdfImage.Keys.Height);

        var canUnfilter = image.Stream.TryUnfilter();
        byte[] decodedBytes;

        if (canUnfilter)
        {
            decodedBytes = image.Stream.Value;
        }
        else
        {
            PdfSharp.Pdf.Filters.FlateDecode flate = new PdfSharp.Pdf.Filters.FlateDecode();
            decodedBytes = flate.Decode(image.Stream.Value);
        }

        int bitsPerComponent = 0;
        while (decodedBytes.Length - ((width * height) * bitsPerComponent / 8) != 0)
        {
            bitsPerComponent++;
        }

        System.Drawing.Imaging.PixelFormat pixelFormat;
        switch (bitsPerComponent)
        {
            case 1:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format1bppIndexed;
                break;
            case 8:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format8bppIndexed;
                break;
            case 16:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format16bppArgb1555;
                break;
            case 24:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
                break;
            case 32:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format32bppArgb;
                break;
            case 64:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format64bppArgb;
                break;
            default:
                throw new Exception("Unknown pixel format " + bitsPerComponent);
        }

        decodedBytes = decodedBytes.Reverse().ToArray();

        Bitmap bmp = new Bitmap(width, height, pixelFormat);
        BitmapData bmpData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.WriteOnly, bmp.PixelFormat);
        int length = (int)Math.Ceiling(width * (bitsPerComponent / 8.0));
        for (int i = 0; i < height; i++)
        {
            int offset = i * length;
            int scanOffset = i * bmpData.Stride;
            Marshal.Copy(decodedBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
        }
        bmp.UnlockBits(bmpData);
        bmp.RotateFlip(RotateFlipType.Rotate180FlipNone);
        bmp.Save(String.Format("exported_Images\\Image{0}.png", count++), System.Drawing.Imaging.ImageFormat.Png);
    }

代碼可能需要一些優化,但在我的例子中它確實正確地導出了 FlateDecoded 圖像。

要獲得 Windows BMP,您只需創建一個 Bitmap header,然后將圖像數據復制到 bitmap。PDF 圖像是字節對齊的(每個新行都從字節邊界開始),而 Windows在 DWORD 邊界上(由於歷史原因,DWORD 是 4 個字節)。 Bitmap header 所需的所有信息都可以在過濾器參數中找到或可以計算得到。

調色板是 PDF 中的另一個 FlateEncoded object。您也可以將其復制到 BMP 中。

必須對多種格式(每像素 1 位、8 bpp、24 bpp、32 bpp)執行此操作。

這是我執行此操作的完整代碼。

我正在從 PDF 中提取 UPS 運輸 label,所以我提前知道格式。 如果您提取的圖像是未知類型,那么您需要檢查bitsPerComponent並相應地處理它。 我也只處理第一頁上的第一張圖片。

注意:我正在使用TryUnfilter來“放氣”,它使用應用的任何過濾器並為我就地解碼數據。 無需明確調用“Deflate”。

    var file = @"c:\temp\PackageLabels.pdf";

    var doc = PdfReader.Open(file);
    var page = doc.Pages[0];

    {
        // Get resources dictionary
        PdfDictionary resources = page.Elements.GetDictionary("/Resources");
        if (resources != null)
        {
            // Get external objects dictionary
            PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
            if (xObjects != null)
            {
                ICollection<PdfItem> items = xObjects.Elements.Values;

                // Iterate references to external objects
                foreach (PdfItem item in items)
                {
                    PdfReference reference = item as PdfReference;
                    if (reference != null)
                    {
                        PdfDictionary xObject = reference.Value as PdfDictionary;
                        // Is external object an image?
                        if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
                        {
                            // do something with your image here 
                            // only the first image is handled here
                            var bitmap = ExportImage(xObject);
                            bmp.Save(@"c:\temp\exported.png", System.Drawing.Imaging.ImageFormat.Bmp);
                        }
                    }
                }
            }
        }
    }

使用這些輔助函數

    private static Bitmap ExportImage(PdfDictionary image)
    {
        string filter = image.Elements.GetName("/Filter");
        switch (filter)
        {
            case "/FlateDecode":
                return ExportAsPngImage(image);

            default:
                throw new ApplicationException(filter + " filter not implemented");
        }
    }

    private static Bitmap ExportAsPngImage(PdfDictionary image)
    {
        int width = image.Elements.GetInteger(PdfImage.Keys.Width);
        int height = image.Elements.GetInteger(PdfImage.Keys.Height);
        int bitsPerComponent = image.Elements.GetInteger(PdfImage.Keys.BitsPerComponent);   

        var canUnfilter = image.Stream.TryUnfilter();
        var decoded = image.Stream.Value;

        Bitmap bmp = new Bitmap(width, height, System.Drawing.Imaging.PixelFormat.Format8bppIndexed);
        BitmapData bmpData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.WriteOnly, bmp.PixelFormat);
        Marshal.Copy(decoded, 0, bmpData.Scan0, decoded.Length);
        bmp.UnlockBits(bmpData);

        return bmp;
    }

到目前為止......我的代碼......它適用於許多png文件,但不適用於來自adobe photoshop的顏色空間索引的文件:

    private bool ExportAsPngImage(PdfDictionary image, string SaveAsName)
        {
            int width = image.Elements.GetInteger(PdfSharp.Pdf.Advanced.PdfImage.Keys.Width);
            int height = image.Elements.GetInteger(PdfSharp.Pdf.Advanced.PdfImage.Keys.Height);
            int bitsPerComponent = image.Elements.GetInteger(PdfSharp.Pdf.Advanced.PdfImage.Keys.BitsPerComponent);
            var ColorSpace = image.Elements.GetArray(PdfImage.Keys.ColorSpace);
System.Drawing.Imaging.PixelFormat pixelFormat= System.Drawing.Imaging.PixelFormat.Format24bppRgb; //24 just for initalize

            if (ColorSpace is null) //no colorspace.. bufferedimage?? is in BGR order instead of RGB so change the byte order. Right now it works
            {
                byte[] origineel_byte_boundary = image.Stream.UnfilteredValue;
                bitsPerComponent = (origineel_byte_boundary.Length) / (width * height);
                switch (bitsPerComponent)
                {
                    case 4:
                        pixelFormat = System.Drawing.Imaging.PixelFormat.Format32bppPArgb;
                        break;
                    case 3:
                        pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
                        break;
                    default:
                        {
                            MessageBox.Show("Unknown pixel format " + bitsPerComponent, "Error", MessageBoxButtons.OK, MessageBoxIcon.Warning);
                            return false;
                        }
                        break;
                }
                Bitmap bmp = new Bitmap(width, height, pixelFormat); //copy raw bytes to "master" bitmap so we are out of pdf format to work with 
                System.Drawing.Imaging.BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, width, height), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat);
                System.Runtime.InteropServices.Marshal.Copy(origineel_byte_boundary, 0, bmd.Scan0, origineel_byte_boundary.Length);
                bmp.UnlockBits(bmd);
                Bitmap bmp2 = new Bitmap(width, height, pixelFormat);
                for (int indicex = 0; indicex < bmp.Width; indicex++)
                {
                    for (int indicey = 0; indicey < bmp.Height; indicey++)
                    {
                        Color nuevocolor = bmp.GetPixel(indicex, indicey);
                        Color colorintercambiado = Color.FromArgb(nuevocolor.A, nuevocolor.B, nuevocolor.G, nuevocolor.R);
                        bmp2.SetPixel(indicex, indicey, colorintercambiado);
                    }
                }
                using (FileStream fs = new FileStream(SaveAsName, FileMode.Create, FileAccess.Write))
                {
                    bmp2.Save(fs, System.Drawing.Imaging.ImageFormat.Png);
                }
                bmp2.Dispose();
                bmp.Dispose();
            }
            else
            {
// this is the case of photoshop... work needs to be done here. I ´m able to get the color palette but no idea how to put it back or create the png file... 
                switch (bitsPerComponent)
                {
                    case 4:
                        pixelFormat = System.Drawing.Imaging.PixelFormat.Format32bppArgb;
                        break;
                    default:
                        {
                            MessageBox.Show("Unknown pixel format " + bitsPerComponent, "Error", MessageBoxButtons.OK, MessageBoxIcon.Warning);
                            return false;
                        }
                        break;
                }
                if ((ColorSpace.Elements.GetName(0) == "/Indexed") && (ColorSpace.Elements.GetName(1) == "/DeviceRGB"))
                {
                    //we need to create the palette
                    int paletteColorCount = ColorSpace.Elements.GetInteger(2);
                    List<System.Drawing.Color> paletteList = new List<Color>();
                    //Color[] palette = new Color[paletteColorCount+1]; // no idea why but it seams that there´s always 1 color more. ¿transparency?
                    PdfObject paletteObj = ColorSpace.Elements.GetObject(3);
                    PdfDictionary paletteReference = (PdfDictionary)paletteObj;
                    byte[] palettevalues = paletteReference.Stream.Value;
                    for (int index = 0; index < (paletteColorCount + 1); index++)
                    {
                        //palette[index] = Color.FromArgb(1, palettevalues[(index*3)], palettevalues[(index*3)+1], palettevalues[(index*3)+2]); // RGB
                        paletteList.Add(Color.FromArgb(1, palettevalues[(index * 3)], palettevalues[(index * 3) + 1], palettevalues[(index * 3) + 2])); // RGB
                    }                  
                }
            }
            return true;
        }

PDF 可能包含帶有遮罩和不同顏色空間選項的圖像,這就是為什么在某些情況下簡單地解碼圖像 object 可能無法正常工作。

因此,代碼還需要檢查 PDF 中的圖像遮罩 (/ImageMask) 和圖像對象的其他屬性(以查看圖像是否也應使用倒置的 colors 或使用索引顏色),以重新創建類似於在 PDF 中顯示的圖像。請參閱官方PDF 參考中的圖像 object、/ImageMask 和 /Decode 字典。

不確定 PDFSharp 是否能夠在 PDF 中找到圖像蒙版對象,但iTextSharp能夠訪問圖像蒙版對象(參見 PdfName.MASK object 類型)。

PDF Extractor SDK這樣的商業工具能夠提取原始形式和“呈現”形式的圖像。

我為 PDF 提取器 SDK 的制造商 ByteScout 工作

也許不能直接回答問題,但從 PDF 中提取圖像的另一種選擇是使用 FreeSpire.PDF,它可以輕松地從 pdf 中提取圖像。 它可以作為 Nuget package https://www.nuget.org/packages/FreeSpire.PDF/ 獲得 他們處理所有圖像格式並可以導出為 PNG。 他們的示例代碼是

using System;
using System.Collections.Generic;
using System.Text;
using System.Drawing;
using Spire.Pdf;

namespace ExtractImagesFromPDF
{
    class Program
    {
        static void Main(string[] args)
        {
            //Instantiate an object of Spire.Pdf.PdfDocument
            PdfDocument doc = new PdfDocument();
            //Load a PDF file 
            doc.LoadFromFile("sample.pdf");
            List<Image> ListImage = new List<Image>();
            for (int i = 0; i < doc.Pages.Count; i++)
            {
                // Get an object of Spire.Pdf.PdfPageBase
                PdfPageBase page = doc.Pages[i];
                // Extract images from Spire.Pdf.PdfPageBase
                Image[] images = page.ExtractImages();
                if (images != null && images.Length > 0)
                {
                    ListImage.AddRange(images);
                }

            }
            if (ListImage.Count > 0)
            {
                for (int i = 0; i < ListImage.Count; i++)
                {
                    Image image = ListImage[i];
                    image.Save("image" + (i + 1).ToString() + ".png", System.Drawing.Imaging.ImageFormat.Png);
                }
                System.Diagnostics.Process.Start("image1.png");
            }  
        }
    }
}

(代碼取自https://www.e-iceblue.com/Tutorials/Spire.PDF/Spire.PDF-Program-Guide/How-to-Extract-Image-From-PDF-in-C.html

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM