简体   繁体   English

使用 selenium 和 Tor 旋转 IP

[英]Rotating IP with selenium and Tor

I have a selenium configuration for scraping a specific HTTP request, this request was send only when I click on a specific REACT element of a website.我有一个用于抓取特定 HTTP 请求的 selenium 配置,仅当我单击网站的特定 REACT 元素时才会发送此请求。 That's the reason why i'm using selenium... can't find other way.这就是我使用硒的原因……找不到其他方法。

I must renew my IP, each time I want to scrape this specific HTTP request.每次我想抓取这个特定的 HTTP 请求时,我都必须更新我的 IP。 For achieve this I use Tor.为此,我使用 Tor。 When I start my python script it works very well, Tor set a new ip and scrape what I want.当我启动我的 python 脚本时,它运行得很好,Tor 设置了一个新的 ip 并抓取了我想要的东西。 I have add a try/catch to my script, if my script can't work the first time, it will retry 10 times.我在我的脚本中添加了一个 try/catch,如果我的脚本第一次不能工作,它会重试 10 次。

The problem is when my script try another time, the IP can't rotate anymore....问题是当我的脚本再试一次时,IP 不能再旋转了....

how achieve this ?如何实现这一目标?



import time
from random import randint
from time import sleep
import os
import subprocess
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
from seleniumwire import webdriver
from selenium.webdriver.firefox.options import Options
from fake_useragent import UserAgent



options_wire = {
    'proxy': {
        'http': 'http://localhost:8088',
        'https': 'https://localhost:8088',
        'no_proxy': ''
    }
}

def firefox_init():
    os.system("killall tor")
    time.sleep(1)
    ua = UserAgent()
    user_agent = ua.random
    subprocess.Popen(("tor --HTTPTunnelPort 8088"),shell=True)
    time.sleep(2)
    return user_agent


def profile_firefox():
    profile = FirefoxProfile()
    profile.set_preference('permissions.default.image', 2)
    profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
    profile.set_preference("general.useragent.override", firefox_init())
    profile.set_preference("driver.privatebrowsing.autostart", True)
    profile.update_preferences()
    return profile



def options_firefox():
    options = Options()
    options.headless = False
    return options


def firefox_closing(driver):
    driver.quit()
    time.sleep(3)
    os.system('killall tor')
      


def headless(url):
    for x in range(0, 10):
        profile = profile_firefox()
        options = options_firefox()
        driver = webdriver.Firefox(seleniumwire_options=options_wire,firefox_profile=profile, options=options, executable_path='******/headless_browser/geckodriver')
        driver.set_window_position(0, 0)
        driver.set_window_size(randint(1024, 2060), randint(1024, 4100))
        # time.sleep(randint(3,10))
        driver.get(url)
        time.sleep(randint(3,8))
        try:
            if driver.find_element_by_xpath("//*[@id=\"*******\"]/main/div/div/div[1]/div[2]/form/div/div[2]/div[1]/button"):
                # driver.find_element_by_xpath("//*[@id=\"*******\"]/main/div/div/div[1]/div[2]/form/div/div[2]/div[1]/button").click()
                # time.sleep(randint(8,10))
                driver.find_element_by_xpath("//*[@id=\"*******\"]/main/div/div/div[1]/div[2]/form/div/div[2]/div[1]/button").click()
                time.sleep(randint(3,6))
                for request in driver.requests:
                    if request.path == "https://api.*********.***/*******/*********":
                        request_api = request
                        raw = str(request_api.body)
                        request_api = raw.split(('b\''))
                        payload_raw = request_api[1]
                        payload = payload_raw[:-1]
                        if payload:
                            header = request.headers
                            print(header, payload)
                            break
                else:
                    continue
                break
    
        except:
            firefox_closing(driver)
            time.sleep(5)
        finally:
            firefox_closing(driver)

            
    return header, payload


url="https://check.torproject.org/?lang=fr"
headless(url)

Thank you谢谢

Well, I can't possibly know how it's not renewing the IP address since you kill the tor process.好吧,我不可能知道它是如何不更新 IP 地址的,因为你杀死了 tor 进程。 Even if you put tor as a service in Systemd, it'd renew as you restart the service, certainly.即使您将 Tor 作为服务放在 Systemd 中,它肯定会在您重新启动服务时更新。 But I might give you some directions:但我可能会给你一些指导:

  • On the fake agent module, try to disable cache to avoid caching in the /tmp directory or using hosted cache server:在假代理模块上,尝试禁用缓存以避免在 /tmp 目录中缓存或使用托管缓存服务器:

    ua = UserAgent(cache=False, use_cache_server=False)

  • Put Tor on systemd and avoid using os.system(), it's not secure and it's open to lots of flaws as you input system commands directly on your script.将 Tor 放在 systemd 上并避免使用 os.system(),它不安全,而且当您直接在脚本上输入系统命令时,它容易出现许多缺陷。 And with the service file, you might just restart the service to renew your IP address.使用服务文件,您可能只需重新启动服务即可更新您的 IP 地址。 You might want to use the Arch Linux Wiki reference to configure your own TOR environment in here !您可能希望使用 Arch Linux Wiki 参考在此处配置您自己的 TOR 环境!

So to achieve this, I use an other proxy, selenium-wire is very good but it need to be fix.所以为了实现这一点,我使用了其他代理,selenium-wire 非常好,但需要修复。

I have use Browsermob proxy and set an upstream proxy to work with.我已经使用 Browsermob 代理并设置了一个上游代理来使用。 The result is you can catch every HTTP resquest or response parse it and the ip rotate every time and use tor HTTPTunnelPort configuration.结果是您可以捕获每个 HTTP 请求或响应解析它并且每次都轮换 ip 并使用 HTTPTunnelPort 配置。

    proxy_params = {'httpProxy': 'localhost:8088', 'httpsProxy': 'localhost:8088'}
    proxy_b = server.create_proxy(params=proxy_params)

Thanks谢谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM