简体   繁体   English

带有Tor代理的Python urllib2抛出HTTP错误403

[英]Python urllib2 with Tor proxy throws `HTTP Error 403`

I am trying to parse a web page using this solution like the following: 我正在尝试使用以下解决方案来解析网页:

from bs4 import BeautifulSoup as bs
import re
import time
import random

----------------------
import socks
import socket

# Can be socks4/5
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5,'127.0.0.1', 9050)
socket.socket = socks.socksocket

# Magic!
def getaddrinfo(*args):
    return [(socket.AF_INET, socket.SOCK_STREAM, 6, '', (args[0], args[1]))]

socket.getaddrinfo = getaddrinfo
----------------------

import urllib2


# define urls
start_url = 'http://www.exmple.com'

# get web page
hdr = request_header()
req = urllib2.Request(start_url)
for key, value in hdr.items():
    req.add_header(key, value)

page = urllib2.urlopen(req)
soup = bs(page.read(), 'lxml')

But I am getting this error: 但我收到此错误:

Traceback (most recent call last):
  File "soupParse.py", line 159, in <module>
    all_r = main()
  File "soupParse.py", line 35, in main
    page = urllib2.urlopen(req)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

Here is the header function: 这是标题函数:

# create random request header
def request_header():
    # change default User-Agent of the request
    user_agent = ['Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
    'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0',
    'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20130401 Firefox/31.0',
    'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/29.0',
    'Mozilla/5.0 (X11; OpenBSD amd64; rv:28.0) Gecko/20100101 Firefox/28.0',
    'Mozilla/5.0 (X11; Linux x86_64; rv:28.0) Gecko/20100101  Firefox/28.0',
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',]

    ua = random.choice(user_agent)
    hdr = {'User-Agent': ua,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'}

    return hdr

I am not very much familiar with this topic, so it is difficult for me to understand the problem. 我对这个主题不是很熟悉,因此我很难理解这个问题。 Please help. 请帮忙。 Thank you. 谢谢。

UPDATE UPDATE

I was able to determine, that this error only occurs with urllib2 . 我能够确定此错误仅发生在urllib2 If I use Requests for example, there is no error. 例如,如果我使用“ Requests ”,则没有错误。 I did not put it as an answer, since I do not know why this problem exists. 我没有把它作为答案,因为我不知道为什么存在这个问题。 If somebody knows, I would be glad to hear it. 如果有人知道,我会很高兴听到。

Good luck and happy scrapy! 祝你好运,开心scrap!

I'd highly recommend firing up Wireshark and making sure that your requests are being proxied as you think they are. 我强烈建议启动Wireshark,并确保已按您的要求代理了您的请求。

BeautifulSoup may be the culprit here as it would logically be importing the socket module first, so try making your import the following: 在这里BeautifulSoup可能是罪魁祸首,因为从逻辑上讲,它首先会导入套接字模块,因此请尝试进行以下导入:

import socks    # Import this first no matter what
import socket    
import re
import time
import random
from bs4 import BeautifulSoup as bs 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM