简体   繁体   English

如何从这个压缩的 PDF/A 中提取文本?

[英]How extract text from this compressed PDF/A?

For machine learning purposes ( sckit-learn ), I need to extract the raw text from lots of PDF files.出于机器学习目的 ( sckit-learn ),我需要从大量 PDF 文件中提取原始文本。 First off, I was using xpdf pdftotext to do this task:首先,我使用xpdf pdftotext来完成这项任务:

exe = r'"'+os.path.join(xpdf_path,"pdftotext.exe")+'"'
cmd = exe+" "+"\""+pdf+"\""+" "+"\""+pdf+".txt"+"\""
subprocess.check_output(cmd)
with open(pdf+".txt") as f:
    texto_converted = f.read()

But unfortunately, for few of them, I was unable to get the text because they are using "stream" on their pdf source, like this one .但不幸的是,对于他们中的少数人来说,我无法获得文本,因为他们在 pdf 源上使用“流”,就像这个

The result is something like this:结果是这样的:

59!"#$%&'()*+,-.#/#01"21"" 345667.0*(879:4$;<;4=<6>4?$@"12!/ 21#$@A$3A$>@>BCDCEFGCHIJKIJLMNIJILOCNPQRDS QPFTRPUCTCVQWBCTTQXFPYTO"21 "#/!"#(Z[12\&A+],$3^_3;9`Z &a# .2"#.b#"(#c#A(87*95d$d4?$d3e#Z"f#\"#2b?2"#`Z 2"!eb2"#H1TBRgF JhiO
jFK# 2"k#`Z !#212##"elf/e21m#*c!n2!!#/bZ!#2#`Z "eo ]$5<$@;A533> "/\ko/f\#e#e#p

I Even trying using zlib + regex:我什至尝试使用 zlib + regex:

import re
import zlib

pdf = open("pdfa.pdf", "rb").read()
stream = re.compile(b'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in re.findall(stream,pdf):
    s = s.strip(b'\r\n')
    try:
        print(zlib.decompress(s).decode('UTF-8'))
        print("")
    except:
        pass

The result was something like this:结果是这样的:

1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm

I even tried pdftopng (xpdf) to try tesseract after, without success So, Is there any way to extract pure text from a PDF like that using Python or a third party app?我什至尝试使用pdftopng(xpdf)尝试tesseract,但没有成功那么,有没有办法像使用Python或第三方应用程序那样从PDF中提取纯文本?

If you want to decompress the streams in a PDF file, I can recommend using qdpf , but on this file如果你想解压缩 PDF 文件中的流,我可以推荐使用qdpf ,但是在这个文件上

 qpdf --decrypt --stream-data=uncompress document.pdf out.pdf

doesn't help either.也无济于事。

I am not sure though why your efforts with xpdf and tesseract did not work out, using image-magick's convert to create PNG files in a temporary directory and tesseract , you can do:我不确定为什么您使用xpdftesseract的努力没有成功,使用 image-magick 的convert在临时目录和tesseract中创建 PNG 文件,您可以这样做:

import os
from pathlib import Path
from tempfile import TemporaryDirectory
import subprocess

DPI=600

def call(*args):
    cmd = [str(x) for x in args]
    return subprocess.check_output(cmd, stderr=subprocess.STDOUT).decode('utf-8')

def ocr(docpath, lang):
    result = []
    abs_path = Path(docpath).expanduser().resolve()
    old_dir = os.getcwd()
    out = Path('out.txt')
    with TemporaryDirectory() as tmpdir:
         os.chdir(tmpdir)
         call('convert', '-density', DPI, abs_path, 'out.png')
         index = -1
         while True:
             # names have no leading zeros on the digits, would be difficult to sort glob() output
             # so just count them
             index += 1
             png = Path(f'out-{index}.png')
             if not png.exists():
                 break
             call('tesseract', '--dpi', DPI, png, out.stem, '-l', lang)
             result.append(out.read_text())
         os.chdir(old_dir)
    return result

pages = ocr('~/Downloads/document.pdf', 'por')
print('\n'.join(pages[1].splitlines()[21:24]))

which gives:这使:

DA NÃO REALIZAÇÃO DE AUDIÊNCIA DE AUTOCOMPOSIÇÃO NO CASO EM CONCRETO

Com vista a obter maior celeridade processual, assim como da impossibilidade de conciliação entre

If you are on Windows, make sure your PDF file is not open in a different process (like a PDF viewer), as Windows doesn't seem to like that. If you are on Windows, make sure your PDF file is not open in a different process (like a PDF viewer), as Windows doesn't seem to like that.

The final print is limited as the full output is quite large.由于完整的 output 相当大,因此最终print数量有限。

This converting and OCR-ing takes a while so you might want to uncomment the print in call() to get some sense of progress.这种转换和 OCR-ing 需要一段时间,因此您可能需要取消对call()中的print的注释以获得一些进展感。

There are two fairly simple techniques you can use.您可以使用两种相当简单的技术。

1) Google's "Tessaract" open source OCR (optical character recognition). 1)谷歌的“Tessaract”开源OCR(光学字符识别)。 You could apply this evenly to all PDFs, though converting all that data into pixels and then working magic upon them is going to be more computationally expensive.您可以将其均匀地应用于所有 PDF,尽管将所有数据转换为像素,然后对它们进行魔术处理会在计算上更加昂贵。 Which is more important, engineer time or CPU time?哪个更重要,工程师时间还是 CPU 时间? There's a pytesseract module.有一个pytesseract模块。 Note that this tool works on image formats, so you'd have to use something like GhostScript (another open source project) to convert all of a PDF's pages to images, then run [py]tessaract on those images.请注意,此工具适用于图像格式,因此您必须使用 GhostScript(另一个开源项目)之类的工具将所有 PDF 页面转换为图像,然后在这些图像上运行 [py]tessaract。

2) pyPDF can get each page and programmatically extract any text draw operations in the order they were drawn onto the page . 2) pyPDF可以获取每个页面并以编程方式按照它们被绘制到页面上的顺序提取任何文本绘制操作。 This may be nothing like the logical reading order of the page... While a PDF could draw all the 'a's and then all the 'b's (and so forth), it's actually more efficient to draw everything in "font a", then everything in "font b".这可能与页面的逻辑阅读顺序完全不同......虽然 PDF可以绘制所有的 'a',然后是所有的 'b'(等等),但实际上以“字体 a”绘制所有内容更有效,然后“字体 b”中的所有内容。 It's important to note that "font b" might just be the italic version of "font a".需要注意的是,“font b”可能只是“font a”的斜体版本。 This produces a shorter/more efficient stream of drawing commands, though probably not by such an amount as to be a good business decision to do so.这会产生更短/更高效的绘图命令 stream,尽管数量可能不是一个好的商业决策。

The kicker here is that a random pile of PDF files might require you to do some OCR.这里的关键是随机一堆 PDF 文件可能需要你做一些 OCR。 A poorly assembled PDF (one with a font subset that has no "to unicode" data) can't be properly mined for text even though it has nothing but text drawing operations.一个组装不良的 PDF(一个字体子集没有“to unicode”数据)即使只有文本绘图操作,也无法正确挖掘文本。 "Draw glyphs one through five from "font C" doesn't mean much if you don't know that those first five glyphs are "glyph", because that's the order they were used in. “如果您不知道前五个字形是“字形”,那么从“字体 C”绘制第 1 到第 5 个字形并没有多大意义,因为那是它们的使用顺序。

On the other hand, if you've got home-grown PDFs or all your pdfs are from some known source (Word's pdf converter for example), you'll know what to expect in advance.另一方面,如果您有自制的 PDF 或所有 pdf 都来自某个已知来源(例如 Word 的 pdf 转换器),您将提前知道会发生什么。

Note that the only thing mentioned above that I've actually used is Ghostscript.请注意,上面提到的唯一我实际使用过的是 Ghostscript。 I remember it having a solid command line interface we used to generate images for some online PDF viewer Many Years Ago.我记得它有一个可靠的命令行界面,我们用来为许多年前的一些在线 PDF 查看器生成图像。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM