简体   繁体   English

如何在Python中通过Tor制作urllib2请求?

[英]How to make urllib2 requests through Tor in Python?

I'm trying to crawl websites using a crawler written in Python. 我正在尝试使用Python编写的爬虫来抓取网站。 I want to integrate Tor with Python meaning I want to crawl the site anonymously using Tor. 我想将Tor与Python集成,这意味着我想使用Tor匿名抓取该站点。

I tried doing this. 我试过这样做。 It doesn't seem to work. 它似乎不起作用。 I checked my IP it is still the same as the one before I used tor. 我检查了我的IP,它仍然与我使用tor之前的IP相同。 I checked it via python. 我通过python检查了它。

import urllib2
proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)

You are trying to connect to a SOCKS port - Tor rejects any non-SOCKS traffic. 您正在尝试连接到SOCKS端口 - Tor拒绝任何非SOCKS流量。 You can connect through a middleman - Privoxy - using Port 8118. 您可以通过中间人 - Privoxy - 使用端口8118进行连接。

Example: 例:

proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support) 
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open('http://www.google.com').read()

Also please note properties passed to ProxyHandler, no http prefixing the ip:port 另请注意传递给ProxyHandler的属性,没有http前缀为ip:port

pip install PySocks

Then: 然后:

import socket
import socks
import urllib2

ipcheck_url = 'http://checkip.amazonaws.com/'

# Actual IP.
print(urllib2.urlopen(ipcheck_url).read())

# Tor IP.
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050)
socket.socket = socks.socksocket
print(urllib2.urlopen(ipcheck_url).read())

Using just urllib2.ProxyHandler as in https://stackoverflow.com/a/2015649/895245 fails with: https://stackoverflow.com/a/2015649/895245中使用urllib2.ProxyHandler失败:

Tor is not an HTTP Proxy

Mentioned at: How can I use a SOCKS 4/5 proxy with urllib2? 提到: 我如何使用urllib2的SOCKS 4/5代理?

Tested on Ubuntu 15.10, Tor 0.2.6.10, Python 2.7.10. 在Ubuntu 15.10,Tor 0.2.6.10,Python 2.7.10上测试。

The following code is 100% working on Python 3.4 以下代码100%适用于Python 3.4

(you need to keep TOR Browser open wil using this code) (您需要使用此代码保持TOR浏览器打开)

This script connects to TOR through socks5 get the IP from checkip.dyn.com, change identity and resend the request to get a the new IP (loops 10 times) 此脚本通过socks5连接到TOR,从checkip.dyn.com获取IP,更改身份并重新发送请求以获取新IP(循环10次)

You need to install the appropriate libraries to get this working. 您需要安装适当的库才能使其正常工作。 (Enjoy and don't abuse) (享受并且不要滥用)

import socks
import socket
import time
from stem.control import Controller
from stem import Signal
import requests
from bs4 import BeautifulSoup
err = 0
counter = 0
url = "checkip.dyn.com"
with Controller.from_port(port = 9151) as controller:
    try:
        controller.authenticate()
        socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
        socket.socket = socks.socksocket
        while counter < 10:
            r = requests.get("http://checkip.dyn.com")
            soup = BeautifulSoup(r.content)
            print(soup.find("body").text)
            counter = counter + 1
            #wait till next identity will be available
            controller.signal(Signal.NEWNYM)
            time.sleep(controller.get_newnym_wait())
    except requests.HTTPError:
        print("Could not reach URL")
        err = err + 1
print("Used " + str(counter) + " IPs and got " + str(err) + " errors")

Using privoxy as http-proxy in front of tor works for me - here's a crawler-template: 在tor之前使用privoxy作为http-proxy对我有用 - 这是一个爬虫模板:


import urllib2
import httplib

from BeautifulSoup import BeautifulSoup
from time import sleep

class Scraper(object):
    def __init__(self, options, args):
        if options.proxy is None:
            options.proxy = "http://localhost:8118/"
        self._open = self._get_opener(options.proxy)

    def _get_opener(self, proxy):
        proxy_handler = urllib2.ProxyHandler({'http': proxy})
        opener = urllib2.build_opener(proxy_handler)
        return opener.open

    def get_soup(self, url):
        soup = None
        while soup is None:
            try:
                request = urllib2.Request(url)
                request.add_header('User-Agent', 'foo bar useragent')
                soup = BeautifulSoup(self._open(request))
            except (httplib.IncompleteRead, httplib.BadStatusLine,
                    urllib2.HTTPError, ValueError, urllib2.URLError), err:
                sleep(1)
        return soup

class PageType(Scraper):
    _URL_TEMPL = "http://foobar.com/baz/%s"

    def items_from_page(self, url):
        nextpage = None
        soup = self.get_soup(url)

        items = []
        for item in soup.findAll("foo"):
            items.append(item["bar"])
            nexpage = item["href"]

        return nextpage, items

    def get_items(self):
        nextpage, items = self._categories_from_page(self._START_URL % "start.html")
        while nextpage is not None:
            nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage)
            items.extend(newitems)
        return items()

pt = PageType()
print pt.get_items()

Update - The latest (upwards of v2.10.0) requests library supports socks proxies with an additional requirement of requests[socks] . 更新 - 最新的(v2.10.0以上版本) requests库支持袜子代理,并需要额外的requests[socks]

Installation - 安装 -

pip install requests requests[socks]

Basic usage - 基本用法 -

import requests
session = requests.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http':  'socks5://127.0.0.1:9050',
                   'https': 'socks5://127.0.0.1:9050'}

# Make a request through the Tor connection
# IP visible through Tor
print session.get("http://httpbin.org/ip").text
# Above should print an IP different than your public IP

# Following prints your normal public IP
print requests.get("http://httpbin.org/ip").text

Old answer - Even though this is an old post, answering because no one seems to have mentioned the requesocks library. 旧的答案 - 即使这是一个旧帖子,回答是因为似乎没有人提到requesocks库。

It is basically a port of the requests library. 它基本上是requests库的一个端口。 Please note that the library is an old fork (last updated 2013-03-25) and may not have the same functionalities as the latest requests library. 请注意,该库是旧的分支(最新更新时间为2013-03-25),可能与最新的请求库不具有相同的功能。

Installation - 安装 -

pip install requesocks

Basic usage - 基本用法 -

# Assuming that Tor is up & running
import requesocks
session = requesocks.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http':  'socks5://127.0.0.1:9050',
                   'https': 'socks5://127.0.0.1:9050'}
# Make a request through the Tor connection
# IP visible through Tor
print session.get("http://httpbin.org/ip").text
# Above should print an IP different than your public IP
# Following prints your normal public IP
import requests
print requests.get("http://httpbin.org/ip").text

The following solution works for me in Python 3 . 以下解决方案适用于Python 3 Adapted from CiroSantilli's answer : 改编自CiroSantilli的答案

With urllib (name of urllib2 in Python 3): 使用urllib (Python 3中urllib2的名称):

import socks
import socket
from urllib.request import urlopen

url = 'http://icanhazip.com/'

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket

response = urlopen(url)
print(response.read())

With requests : requests

import socks
import socket
import requests

url = 'http://icanhazip.com/'

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket

response = requests.get(url)
print(response.text)

With Selenium + PhantomJS: 使用Selenium + PhantomJS:

from selenium import webdriver

url = 'http://icanhazip.com/'

service_args = [ '--proxy=localhost:9150', '--proxy-type=socks5', ]
phantomjs_path = '/your/path/to/phantomjs'

driver = webdriver.PhantomJS(
    executable_path=phantomjs_path, 
    service_args=service_args)

driver.get(url)
print(driver.page_source)
driver.close()

Note : If you are planning to use Tor often, consider making a donation to support their awesome work! 注意 :如果您打算经常使用Tor,请考虑捐赠以支持他们的精彩工作!

Here is a code for downloading files using tor proxy in python: (update url) 这是在python中使用tor代理下载文件的代码:(更新网址)

import urllib2

url = "http://www.disneypicture.net/data/media/17/Donald_Duck2.gif"

proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8118'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break

    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,

f.close()

Perhaps you're having some network connectivity issues? 也许你有一些网络连接问题? The above script worked for me (I substituted a different URL - I used http://stackoverflow.com/ - and I get the page as expected: 上面的脚本对我有用(我替换了一个不同的URL - 我使用了http://stackoverflow.com/ - 我得到了预期的页面:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" >
 <html> <head>

<title>Stack Overflow</title>        
<link rel="stylesheet" href="/content/all.css?v=3856">

(etc.) (等等。)

Tor is a socks proxy. Tor是一个袜子代理。 Connecting to it directly with the example you cite fails with "urlopen error Tunnel connection failed: 501 Tor is not an HTTP Proxy". 使用您引用的示例直接连接到它“urlopen错误隧道连接失败:501 Tor不是HTTP代理”。 As others have mentioned you can get around this with Privoxy. 正如其他人所说,你可以通过Privoxy来解决这个问题。

Alternatively you can also use PycURL or SocksiPy. 或者,您也可以使用PycURL或SocksiPy。 For examples of using both with tor see... 有关使用两者的例子,请参阅...

https://stem.torproject.org/tutorials/to_russia_with_love.html https://stem.torproject.org/tutorials/to_russia_with_love.html

you can use torify 你可以使用torify

run your program with 用你的程序运行

~$torify python your_program.py

Thought I would just share a solution that worked for me (python3, windows10): 以为我会分享一个对我有用的解决方案(python3,windows10):

Step 1: Enable your Tor ControlPort at 9151 . 第1步:在9151启用Tor ControlPort。

Tor service runs at default port 9150 and ControlPort on 9151 . Tor服务在默认端口91509151 ControlPort上运行。 You should be able to see local address 127.0.0.1:9150 and 127.0.0.1:9151 when you run netstat -an . 运行netstat -an时,您应该能够看到本地地址127.0.0.1:9150127.0.0.1:9151

[go to windows terminal]
cd ...\Tor Browser\Browser\TorBrowser\Tor
tor --service remove
tor --service install -options ControlPort 9151
netstat -an 

Step 2: Python script as follow. 第2步:Python脚本如下。

# library to launch and kill Tor process
import os
import subprocess

# library for Tor connection
import socket
import socks
import http.client
import time
import requests
from stem import Signal
from stem.control import Controller

# library for scraping
import csv
import urllib
from bs4 import BeautifulSoup
import time

def launchTor():
    # start Tor (wait 30 sec for Tor to load)
    sproc = subprocess.Popen(r'.../Tor Browser/Browser/firefox.exe')
    time.sleep(30)
    return sproc

def killTor(sproc):
    sproc.kill()

def connectTor():
    socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True)
    socket.socket = socks.socksocket
    print("Connected to Tor")

def set_new_ip():
    # disable socks server and enabling again
    socks.setdefaultproxy()
    """Change IP using TOR"""
    with Controller.from_port(port=9151) as controller:
        controller.authenticate()
        socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True)
        socket.socket = socks.socksocket
        controller.signal(Signal.NEWNYM)

def checkIP():
    conn = http.client.HTTPConnection("icanhazip.com")
    conn.request("GET", "/")
    time.sleep(3)
    response = conn.getresponse()
    print('current ip address :', response.read())

# Launch Tor and connect to Tor network
sproc = launchTor()
connectTor()

# list of url to scrape
url_list = [list of all the urls you want to scrape]

for url in url_list:
    # set new ip and check ip before scraping for each new url
    set_new_ip()
    # allow some time for IP address to refresh
    time.sleep(5)
    checkIP()

    '''
    [insert your scraping code here: bs4, urllib, your usual thingy]
    '''

# remember to kill process 
killTor(sproc)

This script above will renew IP address for every URL that you want to scrape. 上面的脚本将为您要删除的每个URL更新IP地址。 Just make sure to sleep it long enough for IP to change. 只要确保睡眠时间足以让IP改变。 Last tested yesterday. 昨天最后测试。 Hope this helps! 希望这可以帮助!

To expand on the above comment about using torify and the Tor browser (and doesn't need Privoxy): 要扩展上面关于使用torify和Tor浏览器的评论(并且不需要Privoxy):

pip install PySocks
pip install pyTorify

(install Tor browser and start it up) (安装Tor浏览器并启动它)

Command line usage: 命令行用法:

python -mtorify -p 127.0.0.1:9150 your_script.py

Or built into a script: 或者内置到脚本中:

import torify
torify.set_tor_proxy("127.0.0.1", 9150)
torify.disable_tor_check()
torify.use_tor_proxy()

# use urllib as normal
import urllib.request
req = urllib.request.Request("http://....")
req.add_header("Referer", "http://...") # etc
res = urllib.request.urlopen(req)
html = res.read().decode("utf-8")

Note, the Tor browser uses port 9150, not 9050 注意,Tor浏览器使用端口9150,而不是9050

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM