I was trying to download all the slides from the following webpage
https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
The command I was using was
wget --no-check-certificate --no-proxy -r -l 3 'https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html'
I could only download html and some PNG files. Those slides are hosted on Amazon S3 but I could not crawl them using the command above. The message showing on the terminal is
I could, however, download those slides directly using a command below
wget http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf
Anybody knows why? How do I download all the slides on that page using a single command?
What you need to do is called "HTML Scraping". This means that you take an HTML page and then parse the HTML links inside the page. After parsing you can download, catalog, etc. the links found in the document (web page).
This StackOverflow article is very popular for this topic:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.