简体   繁体   中英

optimize web scraping using wget

我正在使用wget下载一个巨大的网页列表(大约70,000)。我被迫在连续的wget之间进行大约2秒的睡眠。这需要花费大量的时间。像70天那样。我想要什么要做的是使用代理,以便我可以显着加快进程。我正在使用一个简单的bash脚本进行此过程。任何建议和意见表示赞赏。

First suggestion is to not use Bash or wget. I would use Python and Beautiful Soup. Wget is not really designed for screen scraping.

Second look into spreading the load across multiple machines by running a portion of your list on each machine.

Since it sounds like bandwidth is your issue you can easily spawn up some cloud images and throw your script on those guys.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM