简体繁体 English

使用 Python 从 pdf 文件中提取文本

[英]Extract text from pdf file with Python

原文 2019-02-26 03:48:52 0 1 python/ pdf/ text-extraction

I would like to extract text, including tables from pdf file.我想从 pdf 文件中提取文本，包括表格。

I tried camelot , but it can only get table data not text.我试过camelot ，但它只能获取表格数据而不是文本。

I also tried PDF2 , however it can't read Chinese characters.我也试过PDF2 ，但是它无法读取汉字。

Here is the pdf sample to read.这是要阅读的pdf 示例。

Are there any recommended text-extraction python packages?有没有推荐的文本提取 python 包？

1 个解决方案

By far the simplest way is to extract text in one OS shell command line using the poppler pdf utility tools (often included in python libraries) then modify that output in python.py as required.到目前为止，最简单的方法是使用 poppler pdf 实用工具（通常包含在 python 库中）在一个操作系统 shell 命令行中提取文本，然后根据需要修改 python.py 中的 output。

>pdftotext -layout -f 1 -l 1 -enc UTF-8 sample.pdf

NOTE some of the text is embeded to right of the logo image and that can be extracted separately using pdftoppm -png or pdfimages then pass to inferior output quality OCR tools for those smaller areas.注意一些文本嵌入到徽标图像的右侧，可以使用pdftoppm -png或pdfimages单独提取，然后传递给那些较小区域的劣质 output 质量 OCR 工具。

从python中的pdf文件对象中提取文本 - extract text from pdf file object in python

如何从Python中提取PDF文件中的文本？ - How to extract text from a PDF file in Python?

需要使用python从PDF文件中提取文本 - Need to extract text from a PDF file with python

从 pdf 提取文本到文件 - Extract text from pdf to file

从 pdf 中提取文本从 S3 存储桶中提取文件 python - extract text from pdf File from S3 bucket python

使用 Python 和 PyPDF2 从 PDF 文件中提取文本 - Extract text from PDF File using Python with PyPDF2

如何使用python仅从PDF文件中提取特定文本 - How to extract only specific text from PDF file using python

在 Python 中从 PDF 文件中提取文本时，打印命令不起作用 - Print Command is not working when extract text from PDF file in Python

在 Python 中提取 PDF 文件的文本和表格 - Extract text and tables of a PDF file in Python

按python类型从pdf中提取文本 - extract text from pdf by type python

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从python中的pdf文件对象中提取文本 - extract text from pdf file object in python 如何从Python中提取PDF文件中的文本？ - How to extract text from a PDF file in Python? 需要使用python从PDF文件中提取文本 - Need to extract text from a PDF file with python 从 pdf 提取文本到文件 - Extract text from pdf to file 从 pdf 中提取文本从 S3 存储桶中提取文件 python - extract text from pdf File from S3 bucket python 使用 Python 和 PyPDF2 从 PDF 文件中提取文本 - Extract text from PDF File using Python with PyPDF2 如何使用python仅从PDF文件中提取特定文本 - How to extract only specific text from PDF file using python 在 Python 中从 PDF 文件中提取文本时，打印命令不起作用 - Print Command is not working when extract text from PDF file in Python 在 Python 中提取 PDF 文件的文本和表格 - Extract text and tables of a PDF file in Python 按python类型从pdf中提取文本 - extract text from pdf by type python

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM