如何從掃描的pdf中提取圖像

Question

我使用 Tesseract 從掃描的 PDF 中提取文本。 其中一些文件還包含圖像。 有沒有辦法得到這些圖像？

我通過將它們轉換為 tiff 文件來為 tesseract 准備我的掃描 pdf。 但是我找不到任何命令行工具來從中提取圖像，就像 pdfimages 對“文本”pdf 所做的那樣。

任何可以幫助我完成工作的工具（或工具組合）的想法？

Answer 1

您將無法將 Tesseract OCR 用於圖像，因為這不是它的設計目的。 最好事先使用工具提取圖像，然后使用 Tesseract 獲取文本。

您可能會使用 xPDF 的 PDFimages。

http://www.xpdfreader.com/pdfimages-man.html

您需要下載 R、Rstudio、xPDFreader 和 PDFtools 來完成此操作。 確保您的程序文件能夠在“環境變量”（如果使用 Windows）中找到，以便 R 可以找到這些程序。

然后做這樣的事情來轉換它。 有關 PDFimages 的幫助，請參閱文檔中的選項。 這就是語法的方式（特別是在 paste0 之后）。 注意選項的位置。 它們必須在文件輸入名稱之前：

  #("PDF to PPM")      
      files <- tools::file_path_sans_ext(list.files(path = dest, pattern = 
 "pdf", full.names = TRUE))
    lapply(files, function(i){
      shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 300 ", i,".pdf", " ",i)))
      })

您也可以只使用 CMD 提示並鍵入

pdftoppm -f 1 -l 10 -r 300 stuff.pdf stuff.ppm

Answer 2

在許多情況下，當某人擁有 PDF 並且他們想要“獲取”圖像時，將頁面本身渲染為圖像通常是令人滿意的。 但是，如果您確實想提取圖像，則需要小心使用什么工具並調查其輸出的聲譽和質量。

要意識到的第一件重要事情是，如果某個工具聲稱“從 PDF 中提取 TIFF”或“從 PDF 中提取 JPG”，那么它們會誤導您，因為 PDF 不包含 JPEG 或 TIFF 圖像。 之所以會產生混淆，是因為 PDF 中采用了這兩種光柵圖像格式可以使用的壓縮技術來壓縮圖像數據，但這與簡單地與 PDF 一起“生活”的 JPG 文件不是一回事。

那里有很多工具，但是您會發現質量差異很大。 有些可以很好地處理簡單的 PDF，但有大小限制或復雜的 PDF 只會使其崩潰或掛起。 有些可以很好地處理 RGB 數據，但它只是跳過或錯誤處理其他顏色空間。 有些不會讓您對數據進行精細控制，只會提取所有內容並將其重新壓縮為 JPEG。 最重要的是，圖像數據通常會以某種方式損壞，您使用的技術必須能夠優雅地處理這些情況。

如果您計划將其部署為企業解決方案的一部分，您需要一個能夠處理幾乎任何可以在野外找到的 PDF 的工具。

Answer 3

1.使用pdfimages提取圖像

pdfimages mydoc.pdf

2. 使用以下提取腳本：

./extractImages.py images*

在新的圖像文件夾中找到您剪下的圖像。 查看在跟蹤文件夾中執行的操作以確保沒有丟失任何圖像。

手術

它將處理所有圖像並在圖像中尋找形狀。 如果找到一個形狀並且大於可配置的大小，它會填充最大邊界框，剪切圖像並將其保存在新圖像中，此外，它將創建名為 traces 的文件夾，其中顯示所有邊界框。

如果您想找到較小的圖像，只需減小minimumWidth和minimumHeight但是如果您將其設置得太低，它將找到每個字符。

在我的測試中它運行得非常好，它只是發現了一些太多的圖像。

提取圖像.py

#!/bin/env python 

import cv2
import numpy as np
import os
from pathlib import Path

def extractImagesFromFile(inputFilename, outputDirectory, tracing=False, tracingDirectory=""):
    
    # Settings:
    minimumWidth = 100
    minimumHeight = 100
    greenColor = (36, 255, 12)
    traceWidth = 2
    
    # Load image, grayscale, Otsu's threshold
    image = cv2.imread(inputFilename)
    original = image.copy()
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

    # Find contours, obtain bounding box, extract and save ROI
    ROI_number = 1
    cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    for c in cnts:
        x, y, w, h = cv2.boundingRect(c)
        if w >= minimumWidth and h >= minimumHeight:
            cv2.rectangle(image, (x, y), (x + w, y + h), greenColor, traceWidth)
            ROI = original[y:y+h, x:x+w]
            outImage = os.path.join(outputDirectory, '{}_{}.png'.format(Path(inputFilename).stem, ROI_number))
            cv2.imwrite(outImage, ROI)
            ROI_number += 1
    if tracing:
        outImage = os.path.join(tracingDirectory, Path(inputFilename).stem + '_trace.png')
        cv2.imwrite(outImage, image)

def main(files):

    tracingEnabled = True
    outputDirectory = 'images'
    tracingDirectory = 'tracing'

    # Create the output directory if it does not exist
    outputPath = Path.cwd() / outputDirectory
    outputPath.mkdir(exist_ok=True)

    if tracingEnabled:
        tracingPath = Path.cwd() / tracingDirectory
        tracingPath.mkdir(exist_ok=True)

    for f in files:
        print("Prcessing {}".format(f))
        if Path(f).is_file():
            extractImagesFromFile(f, outputDirectory, tracingEnabled, tracingDirectory)
        else:
            print("Invalid file: {}".format(f))

if __name__ == "__main__":
    import argparse
    from glob import glob
    parser = argparse.ArgumentParser()  
    parser.add_argument("fileNames", nargs='*') 
    args = parser.parse_args()  
    fileNames = list()  
    for arg in args.fileNames:  
        fileNames += glob(arg)  
    main(fileNames)

信用

nathancy提供了基本算法作為對這個問題的回答：

使用 OpenCV Python 提取所有邊界框

如何從掃描的pdf中提取圖像

問題描述

3 個解決方案

解決方案1
3

解決方案2
1 2017-11-16 23:36:16

解決方案3
0 2020-10-11 23:38:58

1.使用pdfimages提取圖像

2. 使用以下提取腳本：

手術

提取圖像.py

信用

如何從掃描的pdf中提取圖像

問題描述

3 個解決方案

解決方案1 3

解決方案2 1 2017-11-16 23:36:16

解決方案3 0 2020-10-11 23:38:58

1.使用pdfimages提取圖像

2. 使用以下提取腳本：

手術

提取圖像.py

信用

解決方案1
3

解決方案2
1 2017-11-16 23:36:16

解決方案3
0 2020-10-11 23:38:58