简体   繁体   中英

Python/Java script to download all .pdf files from a website

I was wondering if it was possible to write a script that could programmatically go throughout a webpage and download all .pdf file links automatically. Before I start attempting on my own, I want to know whether or not this is possible.

Regards

Yes it's possible. for downloading pdf files you don't even need to use Beautiful Soup or Scrapy.

Downloading from python is very straight forward Build a list of all linkpdf links & download them

Reference to how to build a list of links: http://www.pythonforbeginners.com/code/regular-expression-re-findall

If you need to crawl through several linked pages then maybe one of the frameworks might help If you are willing to build your own crawler here a great tutorial, which btw is also a good intro to Python.https://www.udacity.com/course/viewer#!/c-cs101

Yes its possible.

In python it is simple; urllib will help you to download files from net. For example:

import urllib
urllib.url_retrive("http://example.com/helo.pdf","c://home")

Now you need to make a script that will find links ending with .pdf.

Example html page : Here's a link

You need to download html page and use a htmlparser or use a regular expression.

Yes, this is possible. This is called web scraping. For Python, there's various packages to help with this including scrapy, beautifulsoup, mechanize, as well as many others.

Yes it's possible in Python. You can obtain the html source code, parse it using BeautifulSoup and then find all the tags. Next, you can check the links which end with the .pdf extension. Once you have a list of all the pdf links, you can download them using

wget.download(link)

or requests

A detailed explanation and full source code can be found here:

https://medium.com/@dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Use urllib to download files. For example:

import urllib

urllib.urlretrieve("http://...","file_name.pdf")

Sample script to find links ending with .pdf : https://github.com/laxmanverma/Scripts/blob/master/samplePaperParser/DownloadSamplePapers.py

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM