使用 iTextSharp 從 pdf 中提取圖像及其名稱

Question

我正在使用 iTextSharp c# 從目錄 pdf 中提取圖像及其名稱。 我能夠從 pdf 中提取圖像，但很難根據附帶的屏幕截圖提取其相應的圖像名稱並使用該名稱保存文件。 請找到下面的代碼，讓我知道您的建議。 示例 PDF ： https : //docdro.id/PwBsNR9

代碼：

private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
{
    List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();

    iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
    iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
    iTextSharp.text.pdf.PdfObject PDFObj = null;
    iTextSharp.text.pdf.PdfStream PDFStremObj = null;

    try
    {
        RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
        PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);

        for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
        {
            PDFObj = PDFReaderObj.GetPdfObject(i);

            if ((PDFObj != null) && PDFObj.IsStream())
            {
                PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
                iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);
                if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                {
                }
                if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                {
                    try
                    {

                        iTextSharp.text.pdf.parser.PdfImageObject PdfImageObj =
                 new iTextSharp.text.pdf.parser.PdfImageObject((iTextSharp.text.pdf.PRStream)PDFStremObj);

                        System.Drawing.Image ImgPDF = PdfImageObj.GetDrawingImage();
                        ImgList.Add(ImgPDF);

                    }
                    catch (Exception)
                    {

                    }
                }
            }
        }
        PDFReaderObj.Close();
    }
    catch (Exception ex)
    {
        throw new Exception(ex.Message);
    }
    return ImgList;
}

Answer 1

不幸的是，示例 PDF 沒有標記。 因此，人們必須以其他方式嘗試關聯標題文本和圖像，或者通過分析彼此的位置或者通過利用內容流中的模式。

在手頭的情況下，相對於彼此分析位置是可行的，因為標題總是（至少部分地）繪制在匹配圖像上或者是其正下方的文本。 因此，可以在第一遍中從頁面提取帶有位置的文本，在第二遍中提取圖像，同時在圖像區域或正下方的先前提取的文本中查找標題。 或者，可以先提取具有位置和大小的圖像，然后提取這些區域中的文本。

但是在內容流中也有一定的模式：在繪制相應的圖像之后，總是在單個文本繪制指令中繪制標題。 因此，還可以繼續並一次性提取圖像和下一個繪制的文本作為相關標題。

這兩種方法都可以使用 iText 解析器 API 來實現。 例如，在后一種方法的情況下，如下所示：首先，實現一個行為如所述的渲染偵聽器，即保存圖像和以下文本：

internal class ImageWithTitleRenderListener : IRenderListener
{
    int imageNumber = 0;
    String format;
    bool expectingTitle = false;

    public ImageWithTitleRenderListener(String format)
    {
        this.format = format;
    }

    public void BeginTextBlock()
    { }

    public void EndTextBlock()
    { }

    public void RenderText(TextRenderInfo renderInfo)
    {
        if (expectingTitle)
        {
            expectingTitle = false;
            File.WriteAllText(string.Format(format, imageNumber, "txt"), renderInfo.GetText());
        }
    }

    public void RenderImage(ImageRenderInfo renderInfo)
    {
        imageNumber++;
        expectingTitle = true;

        PdfImageObject imageObject = renderInfo.GetImage();

        if (imageObject == null)
        {
            Console.WriteLine("Image {0} could not be read.", imageNumber);
        }
        else
        {
            File.WriteAllBytes(string.Format(format, imageNumber, imageObject.GetFileType()), imageObject.GetImageAsBytes());
        }
    }
}

然后使用該渲染偵聽器解析文檔頁面：

using (PdfReader reader = new PdfReader(@"EVERMOTION ARCHMODELS VOL.78.pdf"))
{
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    ImageWithTitleRenderListener listener = new ImageWithTitleRenderListener(@"EVERMOTION ARCHMODELS VOL.78-{0:D3}.{1}");
    for (var i = 1; i <= reader.NumberOfPages; i++)
    {
        parser.ProcessContent(i, listener);
    }
}

Answer 2

我希望這會有所幫助。 我正在做這種類型的事情，但如果這會有所幫助。

// existing pdf path
PdfReader reader = new PdfReader(path);
PRStream pst;
PdfImageObject pio;
PdfObject po;
// number of objects in pdf document
int n = reader.XrefSize;
//FileStream fs = null;
// set image file location
//String path = "E:/";
for (int i = 0; i < n; i++)
{
    // get the object at the index i in the objects collection
    po = reader.GetPdfObject(i);
    // object not found so continue
    if (po == null || !po.IsStream())
        continue;
    //cast object to stream
    pst = (PRStream)po;
    //get the object type
    PdfObject type = pst.Get(PdfName.SUBTYPE);
    //check if the object is the image type object
    if (type != null && type.ToString().Equals(PdfName.IMAGE.ToString()))
    {
        //get the image
        pio = new PdfImageObject(pst);
        // fs = new FileStream(path + "image" + i + ".jpg", FileMode.Create);
        //read bytes of image in to an array
        byte[] imgdata = pio.GetImageAsBytes();
        try
        {
            Stream stream = new MemoryStream(imgdata);
            FileStream fs = stream as FileStream;
            if (fs != null) Console.WriteLine(fs.Name);
        }
        catch
        {
        }
    }
}

現在您可以保存您的流。

public void SaveStreamToFile(string fileFullPath, Stream stream)
{
    if (stream.Length == 0) return;

    // Create a FileStream object to write a stream to a file
    using (FileStream fileStream = System.IO.File.Create(fileFullPath, (int)stream.Length))
    {
        // Fill the bytes[] array with the stream data
        byte[] bytesInStream = new byte[stream.Length];
        stream.Read(bytesInStream, 0, (int)bytesInStream.Length);

        // Use FileStream object to write to the specified file
        fileStream.Write(bytesInStream, 0, bytesInStream.Length);
     }
}

使用 iTextSharp 從 pdf 中提取圖像及其名稱

問題描述

2 個解決方案

解決方案1
2 2019-03-19 08:52:42

解決方案2
0 2019-03-16 20:11:52

使用 iTextSharp 從 pdf 中提取圖像及其名稱

問題描述

2 個解決方案

解決方案1 2 2019-03-19 08:52:42

解決方案2 0 2019-03-16 20:11:52

解決方案1
2 2019-03-19 08:52:42

解決方案2
0 2019-03-16 20:11:52