wget does not download files on Amazon AWS S3

Question

I was trying to download all the slides from the following webpage

https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

The command I was using was

wget --no-check-certificate --no-proxy -r -l 3 'https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html'

I could only download html and some PNG files. Those slides are hosted on Amazon S3 but I could not crawl them using the command above. The message showing on the terminal is

I could, however, download those slides directly using a command below

wget http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf

Anybody knows why? How do I download all the slides on that page using a single command?

Answer 1

What you need to do is called "HTML Scraping". This means that you take an HTML page and then parse the HTML links inside the page. After parsing you can download, catalog, etc. the links found in the document (web page).

This StackOverflow article is very popular for this topic:

Options for HTML scraping?

wget does not download files on Amazon AWS S3

Question

1 answers

solution1
0 2018-09-03 16:39:19

wget does not download files on Amazon AWS S3

Question

1 answers

solution1 0 2018-09-03 16:39:19

solution1
0 2018-09-03 16:39:19