简体繁体中英

optimize web scraping using wget

原文 2011-02-23 00:16:35 5 1 bash/ web-scraping/ wget

我正在使用wget下载一个巨大的网页列表（大约70,000）。我被迫在连续的wget之间进行大约2秒的睡眠。这需要花费大量的时间。像70天那样。我想要什么要做的是使用代理，以便我可以显着加快进程。我正在使用一个简单的bash脚本进行此过程。任何建议和意见表示赞赏。

1 answers

First suggestion is to not use Bash or wget. I would use Python and Beautiful Soup. Wget is not really designed for screen scraping.

Second look into spreading the load across multiple machines by running a portion of your list on each machine.

Since it sounds like bandwidth is your issue you can easily spawn up some cloud images and throw your script on those guys.

Data scraping with wget and regex

Download a web page using wget and define a new filename

Web scraping but not scraping changes

using wget to download a directory

Using Wget with buggy URL

wget using an environment variable

How to extract links behind a text tag of web page (using either curl,wget or userscript)

Web Scraping with bash

Arithmetic in web scraping in a shell

Show real time wget output in a web page

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Data scraping with wget and regex Download a web page using wget and define a new filename Web scraping but not scraping changes using wget to download a directory Using Wget with buggy URL wget using an environment variable How to extract links behind a text tag of web page (using either curl,wget or userscript) Web Scraping with bash Arithmetic in web scraping in a shell Show real time wget output in a web page

Related Tags

optimize web scraping using wget

Question

1 answers

solution1 3 ACCPTED 2011-02-23 00:32:15

solution1
3 ACCPTED 2011-02-23 00:32:15