wget 不会在 Amazon AWS S3 上下载文件

Question

I was trying to download all the slides from the following webpage我试图从以下网页下载所有幻灯片

https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

The command I was using was我使用的命令是

wget --no-check-certificate --no-proxy -r -l 3 'https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html'

I could only download html and some PNG files.我只能下载 html 和一些 PNG 文件。 Those slides are hosted on Amazon S3 but I could not crawl them using the command above.这些幻灯片托管在 Amazon S3 上，但我无法使用上面的命令抓取它们。 The message showing on the terminal is终端上显示的消息是

I could, however, download those slides directly using a command below但是，我可以使用以下命令直接下载这些幻灯片

wget http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf

Anybody knows why?有谁知道为什么？ How do I download all the slides on that page using a single command?如何使用单个命令下载该页面上的所有幻灯片？

Answer 1

What you need to do is called "HTML Scraping".您需要做的是“HTML Scraping”。 This means that you take an HTML page and then parse the HTML links inside the page.这意味着您获取一个 HTML 页面，然后解析页面内的 HTML 链接。 After parsing you can download, catalog, etc. the links found in the document (web page).解析后可以下载、编目等在文档（网页）中找到的链接。

This StackOverflow article is very popular for this topic:这篇 StackOverflow 文章在这个话题上很受欢迎：

Options for HTML scraping? HTML 抓取的选项？

wget 不会在 Amazon AWS S3 上下载文件

问题描述

1 个解决方案

解决方案1
0 2018-09-03 16:39:19

wget 不会在 Amazon AWS S3 上下载文件

问题描述

1 个解决方案

解决方案1 0 2018-09-03 16:39:19

解决方案1
0 2018-09-03 16:39:19