简体   繁体   中英

wget does not download files on Amazon AWS S3

I was trying to download all the slides from the following webpage

https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

The command I was using was

wget --no-check-certificate --no-proxy -r -l 3 'https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html'

I could only download html and some PNG files. Those slides are hosted on Amazon S3 but I could not crawl them using the command above. The message showing on the terminal is

I could, however, download those slides directly using a command below

wget http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf

Anybody knows why? How do I download all the slides on that page using a single command?

What you need to do is called "HTML Scraping". This means that you take an HTML page and then parse the HTML links inside the page. After parsing you can download, catalog, etc. the links found in the document (web page).

This StackOverflow article is very popular for this topic:

Options for HTML scraping?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM