I have been given a proxy pool link http://10.10.5.17:5009/proxy_pool that outputs the following:
{
"msg": "success",
"list": [
"111.72.193.250:34621",
"114.99.28.7:25995",
"121.234.245.76:35513",
"220.186.155.66:49366",
"117.90.252.72:45037"
],
"data": "114.99.28.7:25995"
}
These IPs change every few minutes. I'd like to know how to set this up in Scrapy.
I have seen tutorials showing how to add every single IP in settings.py and then call it in middlewares.py, but I cannot do it this way since I need to read IPs from the link (And they change rapidly).
import json
import random
def start_requests(self):
proxy_request = scrapy.Request(url='http://10.10.5.17:5009/proxy_pool', callback=self.prepare_request)
yield proxy_request
def prepare_request(self, response):
target_url = 'XXX'
proxy_response = json.loads(response.body_as_unicode())
proxy_list = [proxy for proxy in proxy_response['list']]
request = scrapy.Request(url=target_url, meta={'proxy': random.choice(proxy_list)}, callback=self.scrape)
def scrape(self, response):
...
You'll have to write your own downloader middleware that handles downloading the proxy list initialy, getting a new list every now and then, and assigning a random proxy from the current list to each request.
You should start by reading the documentation about downloader middlewares . Then, I recommend you find existing middlewares that deal with proxies (eg scrapy-rotating-proxies ) and learn from them.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.