简体繁体中英

Proxy IP for Scrapy framework

原文 2013-10-18 09:46:27 5 3 python/ proxy/ scrapy/ tor

I am developing a web crawling project using Python and Scrapy framework. It crawls approax 10k web pages from e-commerce shopping websites. whole project is working fine but before moving the code from testing server into production server i want choose a better proxy ip provider service, so that i dont have to worry about my IP Blocking or Denied access of websites to my spiders .

Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list available of various websites like this

Now i am confused about the options i should chosse

Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/
Use TOR
Use VPN Service like http://www.hotspotshield.com/
Any Option better than above three

3 answers

Here are the options I'm currently using (depending on my needs):

proxymesh.com - reasonable prices for smaller projects. Never had any issues with the service as it works out of the box with scrapy (I'm not affiliated with them)
a self-build script that starts several EC2 micro instances on Amazon. I then SSH into the machines and create a SOCKS proxy connection, those connections are then piped through delegated to create normal http proxies which are usable with scrapy. The http proxies can either be loadbalanced with something like haproxy or you build yourself a custom middleware that rotates proxies

The latter solution is what currently works best for me and pushes around 20-30GB per day of traffic without any problems.

Crawlera is built specifically for web crawling projects. For example, it implements smart algorithms to avoid getting banned and it is used to crawl very large and high profile websites.

Disclaimer: I work for the mother company Scrapinghub , who also are core developers of Scrapy.

If you don't want to use a paid service please consider just using a scrapy library that will automate rotating proxies for you: https://github.com/TeamHG-Memex/scrapy-rotating-proxies

You can have a look for a full tutorial on how to automate it here: https://tinyendian.com/articles/how-to-scrape-the-web-and-not-get-caught

Keep in mind, that when connecting through a proxy always imposes a performance penalty, but 10K web pages that you mentioned is still well within your reach.

A CONFUSE about the replacement proxy ip problem of the Scrapy framework

python : scrapy using proxy IP

Scrapy change / update public IP via Proxy

Get proxy ip address scrapy using to crawl

Scrapy proxy ip does not work with https, returns 'ssl handshake failure'

Can VM / Machine IP be used instead of Proxy Server for Scrapy

Proxy configuration in Scrapy

Proxy authentication in scrapy request

Alternative to scrapy proxy

Using Tor proxy with scrapy

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question A CONFUSE about the replacement proxy ip problem of the Scrapy framework python : scrapy using proxy IP Scrapy change / update public IP via Proxy Get proxy ip address scrapy using to crawl Scrapy proxy ip does not work with https, returns 'ssl handshake failure' Can VM / Machine IP be used instead of Proxy Server for Scrapy Proxy configuration in Scrapy Proxy authentication in scrapy request Alternative to scrapy proxy Using Tor proxy with scrapy

Related Tags

Proxy IP for Scrapy framework

Question

3 answers

solution1
8 2013-10-19 09:32:33

solution2
7 2013-10-19 01:07:54

solution3
0 2018-04-24 08:35:30

Proxy IP for Scrapy framework

Question

3 answers

solution1 8 2013-10-19 09:32:33

solution2 7 2013-10-19 01:07:54

solution3 0 2018-04-24 08:35:30

solution1
8 2013-10-19 09:32:33

solution2
7 2013-10-19 01:07:54

solution3
0 2018-04-24 08:35:30