简体   繁体   中英

python - Concurrent HTTP requests over multiple Tor proxies

I have multiple Tor clients running on my computer, each accessible by its own port (currently ports 9050-9054). I want to request a large list of URLs concurrently via these Tor clients, with rate limits so that only one request per N seconds is made on any given Tor port, and so that each port is using a unique exit node (ie no two tor ports are ever simultaneously making requests from the same IP).

The goal is to be able to scrape from websites/APIs that rate-limit by IP, by making the requests appear to be coming from multiple different IPs, each consuming at the maximum rate. Anonymity is not really a goal - the use of Tor is just to create a large pool of IPs to trick the rate-limiters into thinking it's a bunch of different users requesting the data. So for instance if the API limits to one request per second, and I have ten tor clients running that are each requesting at this rate, then I can make requests at 10 times the maximum rate...

Here is the code I have so far, which concurrently gets all of the URLs through my pool of Tor clients:

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from time import sleep
from datetime import datetime 
from threading import Lock
import logging

logging.basicConfig(filename='log.txt', level=logging.DEBUG)

tor_ports = ['9050', '9051', '9052', '9053', '9054']
port_locks = {port: Lock() for port in tor_ports}

delay = 1   # wait N seconds between requests on same port

ua_string = 'Mozilla/5.0 (X11; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'

# Create a new HTTP Requests Session routed through one of our Tor proxies
def newTorSession(port):
    assert port in tor_ports
    session = requests.session()
    session.proxies = {'http':  'socks5://127.0.0.1:' + port,
                       'https': 'socks5://127.0.0.1:' + port}
    return session

# Go through the list of all Tor proxies and return one that isn't locked
def getFreeTorPort(hangtime):
    start = datetime.now()
    while (datetime.now() - start).total_seconds() < hangtime: 
        for port in tor_ports:
            if port_locks[port].locked():
                next; # port in use ... try the next one
            else:
                port_locks[port].acquire()
                return port

    return None # this is when we exceed hangtime ... should be an exception we catch


# URL to fetch, and how long to sleep() after request
def torGet(url, delay):

    port = getFreeTorPort(60)
    session = newTorSession(port)

    try:
        response = session.get(url, headers = {'User-Agent': ua_string})
    except requests.exception.RequestException as e:
        logging.warning("Request of URL " + url + " failed with exception: " + e)

    sleep(delay) # Pause for `delay` seconds after request

    port_locks[port].release()

    return response


# given a list of URLs, use multiple threads with own tor client to GET items
def torGetConcurrent(urls):
    responses = []

    with ThreadPoolExecutor(max_workers=len(tor_ports)) as executor:
        futures = [executor.submit(torGet, url, delay) for url in urls]
        responses = [f.result() for f in as_completed(futures)]

    return responses

My question is how can I ensure that each Tor client is always using a different exit IP ? That is, I want to ensure that if I am running 10 tor clients, that I will always be using 10 unique exit nodes. The way I currently have it set up, I sometimes have multiple clients using the same exit, which causes me to surpass my rate limits per IP. I know that I can explicitly specify a list of exit nodes in the ExitNodes field in the torrc file for each client, but I am wondering if there is a way that I could check for this through the python script, since I don't want to have to be manually updating the configuration files with lists of exit nodes, and I don't really care which exit nodes they are using as long as they're all unique.

Thanks!

I wouldn't necessarily say Tor has a "large pool of IPs". There are about 850 exits available these days, some of which may be overloaded and not usable.

In any case, try building a list of fingerprints for all the exits (several websites publish these lists), or the ones in countries you want to use, and set the ExitNodes config for each tor client to a specific fingerprint such that none of them use the same one at the same time. This will be more successful than sending NEWNYM signals to individual clients hoping none of them overlap at the same time and having to run slow checks to see which exit any given client is using.

EDIT:

To do what I describe, grab a list of exits (ie https://check.torproject.org/exit-addresses or https://torstatus.blutmagie.de/ ) and pull them into a list of your choosing so you can pick a unique list of fingerprints at random, then use stem to connect to the control port for each instance. Once connected, set the config value ExitNodes for each instance to one of the fingerprints. ExitNodes can be a country, a list of nodes, or a single node. When set to a single node, you're basically the client to use that relay as an exit. This ensures no two clients will use the same exit relay at the same time. Once you're ready to cycle them, set ExitNodes to a new fingerprint and call SIGNAL NEWNYN to build new circuits.

Doing this is likely faster than having to check each instance and force a new IP if any of them are the same. And then there's no chance of one of the instances building a new circuit between sessions and using a duplicate IP without knowing.

Currently, there is no way to use Python or any other language to get the exit IP or fingerprint without checking IP on an external site. The closes you can usually come is using the control port to see a list of active circuits, extracting the exit fingerprint from that, and finding it's IP from a directory status request. Since Tor can have multiple circuits at once, you can't tell which one your script might be using either.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM