在repl中使用Python從PDF中提取文本

Question

我正在嘗試從 python 中的 PDF 讀取數據，並且我正在嘗試使用 repl.it 文件，因為它更容易測試不同的庫。 我已經嘗試過 PyPDF2 和 PyPDF4，它們可以工作但不提供任何空格。 tika 給了我一個服務器啟動錯誤，pdfminer 不工作，pdfminer3 工作沒有空格。 pdftotext 無法正確下載。 我想知道是否有關於如何讓我的 pdfminer3 提供空白的更清晰的文檔，或者是否有更多的庫可以嘗試。

Answer 1

再試一次tika？ 從其他帖子中，我收集到這是一個非常好的解決方案。

我可以按照此處的說明安裝 tika：

https://github.com/chrismatmann/tika-python

並成功解析了一個測試 pdf 文件。

我按照以下步驟將 TIKA 與 PYTHON 一起使用：

1）安裝（用pip）：

pip install tika

2）創建並運行測試 python 腳本：（當然將 myfile.pdf 替換為您自己的 pdf 文件的路徑）

#!/usr/bin/env python
import tika
tika.initVM()
from tika import parser
parsedPDF = parser.from_file('myfile.pdf')
print(parsedPDF["metadata"])
print(parsedPDF["content"])

請注意，根據您的 tika 服務器未啟動的錯誤，您可能還需要查看此帖子：

將 tika 與 python 一起使用，運行時錯誤：無法啟動 tika 服務器

The currently most upvoted answer on that post basically says to make sure that you have Java installed, and that your installation is at Java 8, as all new versions of the tika-server.jar will require Java 8.

希望這會有所幫助，祝你好運！

Answer 2

# import the libraries for PyDF2
import PyPDF2 
# Making a pdf file 
pdf_file = open('example.pdf', 'rb') 
# creat a pdf 
pdf_reader = PyPDF2.PdfFileReader(pdf_file) 
# print the number of pages in pdf 
print(pdf_reader.numPages) 
# creat the ojbct of pages 
page_obj= pdf_reader.getPage(0) 
# extracting text from page 
print(page_obj.extractText()) 
# closing the pdf file object 
pdf_file.close()

在repl中使用Python從PDF中提取文本

問題描述

2 個解決方案

解決方案1
0 2019-10-12 03:53:16

解決方案2
0 2019-10-12 03:55:03

在repl中使用Python從PDF中提取文本

問題描述

2 個解決方案

解決方案1 0 2019-10-12 03:53:16

解決方案2 0 2019-10-12 03:55:03

解決方案1
0 2019-10-12 03:53:16

解決方案2
0 2019-10-12 03:55:03