使用 Python 從 PDF url 文件中提取文本

Question

我想從一個網站上的 PDF 文件中提取文本。 該網站包含指向 PDF 文檔的鏈接，但是當我單擊該鏈接時，它會自動下載該文件。 是否可以在不下載文件的情況下從該文件中提取文本

import fitz  # this is pymupdf lib for text extraction
from bs4 import BeautifulSoup
import requests
from io import StringIO

url = "https://www.blv.admin.ch/blv/de/home/lebensmittel-und-ernaehrung/publikationen-und-forschung/statistik-und-berichte-lebensmittelsicherheit.html"

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}


response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

all_news = soup.select("div.mod.mod-download a")[0]
pdf = "https://www.blv.admin.ch"+all_news["href"]

#https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf

這是從 pdf 中提取文本的代碼。下載文件后效果很好：

my_pdf_doc = fitz.open(pdf)
text = ""
for page in my_pdf_doc:
    text += page.getText()

print(text)

同樣的問題是如果鏈接不自動下載 pdf 文件，例如這個鏈接：

"https://amsoldingen.ch/images/files/Bekanntgabe-Stimmausschuss-13.12.2020.pdf"

如何從該文件中提取文本

我也試過這個：

pdf_content = requests.get(pdf)
print(type(pdf_content.content))

file = StringIO() 
print(file.write(pdf_content.content.decode("utf-32")))

但我得到錯誤：

Traceback (most recent call last):
  File "/Users/aleksandardevedzic/Desktop/pdf extraction scrapping.py", line 25, in <module>
    print(file.write(pdf_content.content.decode("utf-32")))
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

Answer 1

PyMuPDF 允許我們直接打開一個 BytesIO stream，如文檔中所述。

import requests
import fitz
import io

url = "your-url.pdf"
request = requests.get(url)
filestream = io.BytesIO(request.content)
pdf = fitz.open(stream=filestream, filetype="pdf")

pdf然后可以像常規 PyMuPDF 文檔一樣被解析，如下所示。

PS 這是我在 Stack Overflow 上的第一個回答，歡迎任何改進/建議。

Answer 2

這是使用 PyPDF2 的示例。

安裝

pip install PyPDF2

import requests, PyPDF2
from io import BytesIO

url = 'https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf'
response = requests.get(url)
my_raw_data = response.content

with BytesIO(my_raw_data) as data:
    read_pdf = PyPDF2.PdfFileReader(data)

    for page in range(read_pdf.getNumPages()):
        print(read_pdf.getPage(page).extractText())

輸出：

' 1/21  Fad \nŒ 24.08.2020\n      Bericht 2017\n Œ 2019: Öffentliche Warnungen, \nRückrufe und Schnellwarnsystem RASFF\n      '

Answer 3

我已經完成了對我有用的@Vihaan Thora 解決方案

!pip install PyMuPDF

import requests
import fitz
import io

url = "https://www.livelaw.in/pdf_upload/vsa02052022matfc1162021145829-416435.pdf"
request = requests.get(url)
filestream = io.BytesIO(request.content)
with fitz.open(stream=filestream, filetype="pdf") as doc:
    detail_judgement = ""
    for page in doc:
        detail_judgement += page.get_text()
print(detail_judgement)

Answer 4

在沒有“下載”的情況下讀取位於遠程位置（例如服務器）的 web 應用程序/pdf 文件是不可能的。 瀏覽器/閱讀器/文本提取器是本地的，HTTPS 安全性要求文件在本地作為超文本傳輸工作（除非服務器不太可能專門配置為允許客戶端對其服務文件進行管理編輯）。

您的兩個示例鏈接都會立即在我的瀏覽器中下載，因為我的瀏覽器用戶設置設置為僅安全下載，不會在瀏覽器中運行可利用的視圖。

因此，要提取文本，您會在本地設備文件系統 memory（這通常使用硬盤緩存）中獲得一個臨時副本，其他人建議可以使用 Python FileStream IO 來完成。但這與下載的工作方式沒有太大區別。

該文件可以使用 memory 傳輸到臨時 IO 作為有效的文件字節使用

Curl -O https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf

然后使用相關的 Python 操作系統命令

pdftotext Jahresbericht_2017-2019_DE.pdf | Find "whatever you need"

使用 Python 從 PDF url 文件中提取文本

問題描述

4 個解決方案

解決方案1
2 2022-03-06 18:27:50

解決方案2
1 2020-11-25 01:15:12

解決方案3
0 2022-05-02 11:49:49

解決方案4
0 2022-05-03 00:47:24

使用 Python 從 PDF url 文件中提取文本

問題描述

4 個解決方案

解決方案1 2 2022-03-06 18:27:50

解決方案2 1 2020-11-25 01:15:12

解決方案3 0 2022-05-02 11:49:49

解決方案4 0 2022-05-03 00:47:24

解決方案1
2 2022-03-06 18:27:50

解決方案2
1 2020-11-25 01:15:12

解決方案3
0 2022-05-02 11:49:49

解決方案4
0 2022-05-03 00:47:24