使用 python 和請求提取嵌入在 web 頁面中的 pdf 中的一些信息

Question

我正在嘗試使用 python 和請求在 web 頁面中嵌入的 pdf 中提取一些信息，這正是我想要達到的句子 « Sciences de la vie et de l'environnement »。

這是您編寫的代碼：

import time
import requests  
from bs4 import BeautifulSoup

# website to scrap
url = "https://fs.uit.ac.ma/avis-de-soutenance-dune-these-de-doctorat-mme-achachi-hind/"

with requests.session() as s:
    # get the url from requests get method
    html_content = s.get(url, verify=False)
    # Parse the html content
    soup = BeautifulSoup(html_content.content, "html.parser")
    url2 = soup.iframe["src"]
    html_doc = s.get(url2, verify=False).text
    print(html_doc)

這是一些打印（html_doc），

打印結果

對比兩張圖，最后一張圖看不出里面是什么：

 <div id="viewer" class="pdfViewer"></div>

這條線里面是我想要的文字：

我想到達的線

Answer 1

您可以手動訪問 PDF ( https://fs.uit.ac.ma/wp-content/uploads/2022/02/AVIS-DE-SOUTENANCE-ACHACHI-HIND.pdf )。 iframe和request中有url。 如果無法從源代碼中獲取 url，則必須抓取請求（例如使用 BrowserMob）

使用 python 和請求提取嵌入在 web 頁面中的 pdf 中的一些信息

問題描述

1 個解決方案

解決方案1
1 已采納 2022-03-04 14:58:59

使用 python 和請求提取嵌入在 web 頁面中的 pdf 中的一些信息

問題描述

1 個解決方案

解決方案1 1 已采納 2022-03-04 14:58:59

解決方案1
1 已采納 2022-03-04 14:58:59