简体   繁体   English

Python urllib3和代理

[英]Python urllib3 and proxy

I am trying to figure out how to use proxy and multithreading. 我试图弄清楚如何使用代理和多线程。

This code works: 此代码有效:

requester = urllib3.PoolManager(maxsize = 10, headers = self.headers)
thread_pool = workerpool.WorkerPool()

thread_pool.map(grab_wrapper, [item['link'] for item in products])

thread_pool.shutdown()
thread_pool.wait()  

Then in grab_wrapper 然后在grab_wrapper

requested_page = requester.request('GET', url, assert_same_host = False, headers = self.headers)

Headers consist of: Accept, Accept-Charset, Accept-Encoding, Accept-Language and User-Agent 标题包括:Accept,Accept-Charset,Accept-Encoding,Accept-Language和User-Agent

But this does not work in production, since it has to pass proxy, no authorization is required. 但这在生产中不起作用,因为它必须通过代理,不需要授权。

I tried different things (passing proxies to request, in headers, etc.). 我尝试了不同的东西(将proxies传递给请求,在标题中等)。 The only thing that works is this: 唯一有效的是:

requester = urllib3.proxy_from_url(self._PROXY_URL, maxsize = 7, headers = self.headers)
thread_pool = workerpool.WorkerPool(size = 10)

thread_pool.map(grab_wrapper, [item['link'] for item in products])

thread_pool.shutdown()
thread_pool.wait()  

Now, when I run the program, it will make 10 requests (10 threads) and then... stop. 现在,当我运行程序时,它将发出10个请求(10个线程),然后......停止。 No error, no warning whatsoever. 没有错误,没有任何警告。 This is the only way I can bypass proxy, but it seems like its not possible to use proxy_from_url and WorkerPool together. 这是我可以绕过代理的唯一方法,但似乎不可能一起使用proxy_from_urlWorkerPool

Any ideas how to combine those two into a working code? 有关如何将这两者合并为工作代码的任何想法? I would rather avoid rewriting it into scrappy, etc. due to time limitation 由于时间限制,我宁愿避免将其重写为杂乱无章等

Regards 问候

first of all I would suggest to avoid urllib like the plague and instead use requests, which has really easy support for proxies: http://docs.python-requests.org/en/latest/user/advanced/#proxies 首先,我建议避免使用像瘟疫一样的urllib,而是使用请求,这些请求非常容易支持代理: http//docs.python-requests.org/en/latest/user/advanced/#proxies
Next to that, I haven't use it with multi-threading but with multi-processing and that worked really well, the only thing that you have to figure out is whether you have a dynamic queue or a fairly fixed list that you can spread over workers, an example of the latter which spreads a list of urls evenly over x processes: 接下来,我没有使用它与多线程,但多处理,并且工作得很好,你必须要弄清楚的唯一的事情是你是否有一个动态队列或一个相当固定的列表,你可以传播over workers,后者的一个例子,它在x进程上均匀地分配url列表:

# *** prepare multi processing
nr_processes = 4
chunksize = int(math.ceil(total_nr_urls / float(nr_processes)))
procs = []
# *** start up processes
for i in range(nr_processes):
    start_row = chunksize * i
    end_row = min(chunksize * (i + 1), total_nr_store)
    p = multiprocessing.Process(
            target=url_loop,
            args=(start_row, end_row, str(i), job_id_input))
    procs.append(p)
    p.start()
# *** Wait for all worker processes to finish
for p in procs:
    p.join()

every url_loop process writes away its own sets of data to tables in a database, so I don't have to worry about joining it together in python. 每个url_loop进程都会将自己的数据集写入数据库中的表,因此我不必担心在python中将它们连接在一起。

Edit: On sharing data between processes -> For details see: http://docs.python.org/2/library/multiprocessing.html?highlight=multiprocessing#multiprocessing 编辑:在进程之间共享数据 - >有关详细信息,请参阅: http//docs.python.org/2/library/multiprocessing.html?highlight = multipleprocessing #multiprocessing

from multiprocessing import Process, Value, Array

def f(n, a):
    n.value = 3.1415927
    for i in range(len(a)):
        a[i] = -a[i]

if __name__ == '__main__':
    num = Value('d', 0.0)
    arr = Array('i', range(10))

    p = Process(target=f, args=(num, arr))
    p.start()
    p.join()

    print num.value
    print arr[:]

But as you see, basically these special types (Value & Array) enable sharing of data between the processes. 但是如您所见,基本上这些特殊类型(值和数组)可以在进程之间共享数据。 If you instead look for a queue to do a roundrobin like process, you can use JoinableQueue. 如果您改为查找队列来执行类似圆形的过程,则可以使用JoinableQueue。 Hope this helps! 希望这可以帮助!

It seems you are discarding the result of the call to thread_pool.map() Try assigning it to a variable: 看来你丢弃调用thread_pool.map()尝试将它赋给变量:

requester = urllib3.proxy_from_url(PROXY, maxsize=7)
thread_pool = workerpool.WorkerPool(size=10)


def grab_wrapper(url):
    return requester.request('GET', url)


results = thread_pool.map(grab_wrapper, LINKS)

thread_pool.shutdown()
thread_pool.wait()

Note: If you are using a python 3.2 or greater you can use concurrent.futures.ThreadPoolExecutor . 注意:如果您使用的是3.2或更高版本的python,则可以使用concurrent.futures.ThreadPoolExecutor It is quote similar to workerpool but is included in the standard library. 它的引用类似于workerpool但包含在标准库中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM