简体   繁体   English

使用 Python 从 pdf 文件中提取文本

[英]Extract text from pdf file with Python

I would like to extract text, including tables from pdf file.我想从 pdf 文件中提取文本,包括表格。

I tried camelot , but it can only get table data not text.我试过camelot ,但它只能获取表格数据而不是文本。

I also tried PDF2 , however it can't read Chinese characters.我也试过PDF2 ,但是它无法读取汉字。

Here is the pdf sample to read.这是要阅读的pdf 示例

Are there any recommended text-extraction python packages?有没有推荐的文本提取 python 包?

By far the simplest way is to extract text in one OS shell command line using the poppler pdf utility tools (often included in python libraries) then modify that output in python.py as required.到目前为止,最简单的方法是使用 poppler pdf 实用工具(通常包含在 python 库中)在一个操作系统 shell 命令行中提取文本,然后根据需要修改 python.py 中的 output。

>pdftotext -layout -f 1 -l 1 -enc UTF-8 sample.pdf

NOTE some of the text is embeded to right of the logo image and that can be extracted separately using pdftoppm -png or pdfimages then pass to inferior output quality OCR tools for those smaller areas.注意一些文本嵌入到徽标图像的右侧,可以使用pdftoppm -pngpdfimages单独提取,然后传递给那些较小区域的劣质 output 质量 OCR 工具。

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM