简体   繁体   中英

Downloading multiple pdf's from website using web-scraping

Hi everyone I need some help with my web-scraper as I want to download 100s of pdf files from https://jbiomedsci.biomedcentral.com/ as I'm trying to download as much biomedical pdfs as I can from the website. I have built the web-scraper using some answers from this website but I can't seem to get it to work properly.

My aim is to download the pdfs and store them in specific folder and I would grateful for any help with this.

url="https://jbiomedsci.biomedcentral.com/articles"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")     
links = soup.find_all('a', href=re.compile(r'(.pdf)'))



url_list = []
  for el in links:
if(el['href'].startswith('http')):
url_list.append(el['href'])
   else:
    url_list.append("https://jbiomedsci.biomedcentral.com" + el['href'])

    print(url_list)



for url in url_list:
print(url)
pathname ="C:/Users/SciencePDF/"
fullfilename = os.path.join(pathname, url.replace("https://jbiomedsci.biomedcentral.com/articles", 
 ""))
print(fullfilename)
request.urlretrieve(url, fullfilename)

I've modified your script to make it work. When you try the following script, it will create a folder within the same directory where the location of your script is and store the downloaded pdf files within the newly created folder.

import os
import requests
from bs4 import BeautifulSoup

base = 'https://jbiomedsci.biomedcentral.com{}'
url = 'https://jbiomedsci.biomedcentral.com/articles'

res = requests.get(url)
soup = BeautifulSoup(res.text,"html.parser")

foldername = url.split("/")[-1]
os.mkdir(foldername)

for pdf in soup.select("a[data-track-action='Download PDF']"):
    filename = pdf['href'].split("/")[-1]
    fdf_link = base.format(pdf['href']) + ".pdf"
    with open(f"{foldername}/{filename}.pdf", 'wb') as f:
        f.write(requests.get(fdf_link).content)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM