简体   繁体   中英

Extract some information in a pdf embedded in a web page using python and requests

I am trying to extract some information in a pdf embedded in a web page using python and requests, And this is exactly the sentence I want to reach « Sciences de la vie et de l'environnement ».

image

Here is the code you wrote:

import time
import requests  
from bs4 import BeautifulSoup

# website to scrap
url = "https://fs.uit.ac.ma/avis-de-soutenance-dune-these-de-doctorat-mme-achachi-hind/"

with requests.session() as s:
    # get the url from requests get method
    html_content = s.get(url, verify=False)
    # Parse the html content
    soup = BeautifulSoup(html_content.content, "html.parser")
    url2 = soup.iframe["src"]
    html_doc = s.get(url2, verify=False).text
    print(html_doc)

Here's some of what print(html_doc),

Print result

When comparing the two pictures, I can't see what's inside in the last picture:

 <div id="viewer" class="pdfViewer"></div>

Where inside this line is the writing that I want:

The line I want to reach

You can access the PDF manually ( https://fs.uit.ac.ma/wp-content/uploads/2022/02/AVIS-DE-SOUTENANCE-ACHACHI-HIND.pdf ). There is the url in the iframe and request. In case of there is no way to get the url from the source code, you have to scrape the requests (eg. with BrowserMob)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM