如何确保使用列表上的套接字发出BS4请求？

Question

我有一个这样的代理列表，我想在使用python进行爬取时使用：

proxies_ls = [  '149.56.89.166:3128',
            '194.44.176.116:8080',
            '14.203.99.67:8080',
            '185.87.65.204:63909',
            '103.206.161.234:63909',
            '110.78.177.100:65103']

并制作了一个函数，以便使用bs4抓取网址并请求名为crawlSite（url）的模块。 这是代码：

# Bibliotecas para crawl e regex
from bs4 import BeautifulSoup
import requests
from fake_useragent import UserAgent
import re

#Biblioteca para data
import datetime
from time import gmtime, strftime

#Biblioteca para escrita dos logs
import os
import errno

#Biblioteca para delay aleatorio
import time
import random

print('BOT iniciado: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S'))

proxies_ls = [  '149.56.89.166:3128',
            '194.44.176.116:8080',
            '14.203.99.67:8080',
            '185.87.65.204:63909',
            '103.206.161.234:63909',
            '110.78.177.100:65103']

def crawlSite(url):
    #Chrome emulation
    ua=UserAgent()
    header={'user-agent':ua.chrome}
    random.shuffle(proxies_ls)

    #Random delay
    print('antes do delay: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S'))
    tempoRandom=random.randint(1,5)
    time.sleep(tempoRandom)

    try:
        randProxy=random.choice(proxies_ls)
        # Getting the webpage, creating a Response object emulated with chrome with a 30sec timeout.
        response = requests.get(url,proxies = {'https':randProxy},headers=header,timeout=30)
        print(response)
        print('Resposta obtida: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S'))

        #Avoid HTTP request errors
        if response.status_code == 404:
            raise ConnectionError("HTTP Response [404] - The requested resource could not be found")
        elif response.status_code == 409:            
            raise ConnectionError("HTTP Response [409] - Possible Cloudflare DNS resolution error")
        elif response.status_code == 403:
            raise ConnectionError("HTTP Response [403] - Permission denied error")
        elif response.status_code == 503:
            raise ConnectionError("HTTP Response [503] - Service unavailable error")
        print('RR Status {}'.format(response.status_code))
        # Extracting the source code of the page.
        data = response.text

    except ConnectionError:
        try:
            proxies_ls.remove(randProxy)
        except ValueError:
            pass
        randProxy=random.choice(proxies_ls)

    return BeautifulSoup(data, 'lxml')

我想做的是确保仅在列表中使用该列表上的代理。 随机部分

 randProxy=random.choice(proxies_ls)

工作正常，但检查代理是否有效的部分无效。 主要是因为我仍然收到200个“虚假代理”作为回应。

如果我将清单缩减为：

proxies_ls = ['149.56.89.166:3128']

使用不起作用的代理服务器，我仍然得到200作为响应！ （我尝试使用https://pt.infobyip.com/proxychecker.php之类的proxychecker，它不起作用...）

所以我的问题是（我将列举出来，这样更容易）：a）为什么我得到这个200的响应而不是一个4xx的响应？ b）如何强制请求根据需要使用代理？

谢谢，

欧尼托

Answer 1

仔细阅读文档，您必须在字典中指定以下内容：

http://docs.python-requests.org/en/master/user/advanced/#proxies

代理使用什么协议
代理使用什么协议
代理的地址和端口

“工作”字典应如下所示：

proxies = {
    'https': 'socks5://localhost:9050'
}

这将仅代理和所有https请求。 这意味着它将不会代理http 。

因此，要代理所有webtraffic，您应该像下面这样配置dict：

proxies = {
    'https': 'socks5://localhost:9050'
    'http':  'socks5://localhost:9050'
}

当然，并在必要时替换IP地址。 有关其他情况，请参见以下示例：

$ python
>>> import requests
>>> proxies = {'https':'http://149.58.89.166:3128'}
>>> # Get a HTTP page (this goes around the proxy)
>>> response = requests.get("http://www.example.com/",proxies=proxies)
>>> response.status_code
200
>>> # Get a HTTPS page (so it goes through the proxy)
>>> response = requests.get("https://www.example.com/", proxies=proxies)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 485, in send
    raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.example.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d1f448c10>: Failed to establish a new connection: [Errno 110] Connection timed out',)))

Answer 2

因此，基本上，如果我对您的问题正确，那么您只想检查代理服务器是否有效。 requests具有针对该requests的异常处理程序，您可以执行以下操作：

from requests.exceptions import ProxyError
try:
    response = requests.get(url,proxies = {'https':randProxy},headers=header,timeout=30)
except ProxyError:
    # message proxy is invalid

如何确保使用列表上的套接字发出BS4请求？

问题描述

2 个解决方案

解决方案1
0 2017-09-15 10:13:44

解决方案2
0 2017-09-15 12:31:41

如何确保使用列表上的套接字发出BS4请求？

问题描述

2 个解决方案

解决方案1 0 2017-09-15 10:13:44

解决方案2 0 2017-09-15 12:31:41

解决方案1
0 2017-09-15 10:13:44

解决方案2
0 2017-09-15 12:31:41