简体   繁体   English

如何从扫描的pdf中提取图像

[英]How to extract images from a scanned pdf

I use Tesseract to extract text from scanned PDF.我使用 Tesseract 从扫描的 PDF 中提取文本。 Some of these files also contain images.其中一些文件还包含图像。 Is there a way to get those images?有没有办法得到这些图像?

I prepare my scanned pdf for tesseract by converting them in tiff files.我通过将它们转换为 tiff 文件来为 tesseract 准备我的扫描 pdf。 But I can't find any command line tool to extract images from them, as pdfimages would do for "text" pdf.但是我找不到任何命令行工具来从中提取图像,就像 pdfimages 对“文本”pdf 所做的那样。

Any idea of a tool (or a combination of tools) that would help me do the job?任何可以帮助我完成工作的工具(或工具组合)的想法?

You won't be able to use Tesseract OCR for images, as that's not what it was designed to do.您将无法将 Tesseract OCR 用于图像,因为这不是它的设计目的。 Best to use a tool to extract the images beforehand, and then get the text later using Tesseract.最好事先使用工具提取图像,然后使用 Tesseract 获取文本。

You may get some use out of PDFimages, by xPDF.您可能会使用 xPDF 的 PDFimages。

http://www.xpdfreader.com/pdfimages-man.html http://www.xpdfreader.com/pdfimages-man.html

You will need to download R, Rstudio, xPDFreader, and PDFtools to accomplish this.您需要下载 R、Rstudio、xPDFreader 和 PDFtools 来完成此操作。 Make sure your program files are able to be found in "Environment Variables" (if using Windows) so that R can find the programs.确保您的程序文件能够在“环境变量”(如果使用 Windows)中找到,以便 R 可以找到这些程序。

Then do something like this to convert it.然后做这样的事情来转换它。 See the options in documentation for help on PDFimages.有关 PDFimages 的帮助,请参阅文档中的选项。 This is just how the syntax will be (specifically after paste0).这就是语法的方式(特别是在 paste0 之后)。 Note the placement of the options.注意选项的位置。 They have to be before the file input name:它们必须在文件输入名称之前:

  #("PDF to PPM")      
      files <- tools::file_path_sans_ext(list.files(path = dest, pattern = 
 "pdf", full.names = TRUE))
    lapply(files, function(i){
      shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 300 ", i,".pdf", " ",i)))
      })

You could also just use the CMD prompt and type您也可以只使用 CMD 提示并键入

pdftoppm -f 1 -l 10 -r 300 stuff.pdf stuff.ppm

In many cases when someone has a PDF and they want to 'get' the images out, a rendering of the page itself to an image is often satisfactory.在许多情况下,当某人拥有 PDF 并且他们想要“获取”图像时,将页面本身渲染为图像通常是令人满意的。 However, if you do indeed want to extract the images you need to be careful what tool you use and investigate its reputation and quality of its output.但是,如果您确实想提取图像,则需要小心使用什么工具并调查其输出的声誉和质量。

The first important thing to realize is if a tool claims to "extract the TIFF out of the PDF" or "extract the JPG out of the PDF" then they are misleading you as PDF doesn't contain JPEG or TIFF images per say.要意识到的第一件重要事情是,如果某个工具声称“从 PDF 中提取 TIFF”或“从 PDF 中提取 JPG”,那么它们会误导您,因为 PDF 不包含 JPEG 或 TIFF 图像。 The confusions arises because the compression technology that can be used by those two raster image formats is employed in PDF for compressing image data but it's not the same thing as a JPG file simply 'living' with a PDF.之所以会产生混淆,是因为 PDF 中采用了这两种光栅图像格式可以使用的压缩技术来压缩图像数据,但这与简单地与 PDF 一起“生活”的 JPG 文件不是一回事。

There are many tools out there, however you will find the quality will vary widely.那里有很多工具,但是您会发现质量差异很大。 Some can handle simple PDFs well, but have size limitations or complex PDFs simply make it crash or hang.有些可以很好地处理简单的 PDF,但有大小限制或复杂的 PDF 只会使其崩溃或挂起。 Some can handle RGB data well, but it simply skips or mishandles other color spaces.有些可以很好地处理 RGB 数据,但它只是跳过或错误处理其他颜色空间。 Some won't let you have granular control over the data and will simply extract everything and recompress it as JPEG.有些不会让您对数据进行精细控制,只会提取所有内容并将其重新压缩为 JPEG。 To top all of that off, often the image data can be corrupt in some way and the technology you're using has to be able to gracefully handle those scenarios.最重要的是,图像数据通常会以某种方式损坏,您使用的技术必须能够优雅地处理这些情况。

If you plan on deploying this as part of an enterprise solution you need a tool capable of handling most any PDF you can find out there in the wild.如果您计划将其部署为企业解决方案的一部分,您需要一个能够处理几乎任何可以在野外找到的 PDF 的工具。

1. Extract the images using pdfimages 1.使用pdfimages提取图像

pdfimages mydoc.pdf

2. Use the following extraction script: 2. 使用以下提取脚本:

./extractImages.py images*

Find your cut out images in a new images folder.在新的图像文件夹中找到您剪下的图像 Look at what was done in the tracing folder to make sure no images were missed.查看在跟踪文件夹中执行的操作以确保没有丢失任何图像。

Operation手术

It will process all images and look for shapes inside the images.它将处理所有图像并在图像中寻找形状。 If a shape is found and is larger than a configurable size it fill figure out the maximum bounding box, cut out the image and save it in a new images, in addition it will create folder named traces where it shows all the bounding boxes.如果找到一个形状并且大于可配置的大小,它会填充最大边界框,剪切图像并将其保存在新图像中,此外,它将创建名为 traces 的文件夹,其中显示所有边界框。

If you want to find smaller images, just decrease the minimumWidth and minimumHeight however if you set it too low it will find each character.如果您想找到较小的图像,只需减小minimumWidthminimumHeight但是如果您将其设置得太低,它将找到每个字符。

In my tests it works extremely well, it just finds a few too many images.在我的测试中它运行得非常好,它只是发现了一些太多的图像。

extractImages.py提取图像.py

#!/bin/env python 

import cv2
import numpy as np
import os
from pathlib import Path

def extractImagesFromFile(inputFilename, outputDirectory, tracing=False, tracingDirectory=""):
    
    # Settings:
    minimumWidth = 100
    minimumHeight = 100
    greenColor = (36, 255, 12)
    traceWidth = 2
    
    # Load image, grayscale, Otsu's threshold
    image = cv2.imread(inputFilename)
    original = image.copy()
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

    # Find contours, obtain bounding box, extract and save ROI
    ROI_number = 1
    cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    for c in cnts:
        x, y, w, h = cv2.boundingRect(c)
        if w >= minimumWidth and h >= minimumHeight:
            cv2.rectangle(image, (x, y), (x + w, y + h), greenColor, traceWidth)
            ROI = original[y:y+h, x:x+w]
            outImage = os.path.join(outputDirectory, '{}_{}.png'.format(Path(inputFilename).stem, ROI_number))
            cv2.imwrite(outImage, ROI)
            ROI_number += 1
    if tracing:
        outImage = os.path.join(tracingDirectory, Path(inputFilename).stem + '_trace.png')
        cv2.imwrite(outImage, image)

def main(files):

    tracingEnabled = True
    outputDirectory = 'images'
    tracingDirectory = 'tracing'

    # Create the output directory if it does not exist
    outputPath = Path.cwd() / outputDirectory
    outputPath.mkdir(exist_ok=True)

    if tracingEnabled:
        tracingPath = Path.cwd() / tracingDirectory
        tracingPath.mkdir(exist_ok=True)

    for f in files:
        print("Prcessing {}".format(f))
        if Path(f).is_file():
            extractImagesFromFile(f, outputDirectory, tracingEnabled, tracingDirectory)
        else:
            print("Invalid file: {}".format(f))

if __name__ == "__main__":
    import argparse
    from glob import glob
    parser = argparse.ArgumentParser()  
    parser.add_argument("fileNames", nargs='*') 
    args = parser.parse_args()  
    fileNames = list()  
    for arg in args.fileNames:  
        fileNames += glob(arg)  
    main(fileNames)

Credit信用

The basic algorithm was provided by nathancy as an answer to this question: nathancy提供了基本算法作为对这个问题的回答:

Extract all bounding boxes using OpenCV Python 使用 OpenCV Python 提取所有边界框

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM