在Python中读取和处理100,000个URL的最快方法

Question

I have a file with 100,000 URLs that I need to request then process. 我有一个包含100,000个URL的文件，我需要先请求然后进行处理。 The processing takes a non-negligible amount of time compared to the request, so simply using multithreading seems to only give me a partial speed-up. 与请求相比，该处理花费了不可忽略的时间，因此简单地使用多线程似乎仅能部分加快速度。 From what I have read, I think using the multiprocessing module, or something similar, would offer a more substantial speed-up because I could use multiple cores. 从我的阅读中，我认为使用multiprocessing模块或类似的模块将提供更大的提速，因为我可以使用多个内核。 I'm guessing I want to use some multiple processes, each with multiple threads, but I'm not sure how to do that. 我猜想要使用多个进程，每个进程都有多个线程，但是我不确定该怎么做。

Here is my current code, using threading (based on What is the fastest way to send 100,000 HTTP requests in Python? ): 这是我当前使用线程的代码（基于Python中发送100,000个HTTP请求的最快方法是什么？）：

from threading import Thread
from Queue import Queue
import requests
from bs4 import BeautifulSoup
import sys

concurrent = 100

def worker():
    while True:
        url = q.get()
        html = get_html(url)
        process_html(html)
        q.task_done()

def get_html(url):
    try:
        html = requests.get(url, timeout=5, headers={'Connection':'close'}).text
        return html
    except:
        print "error", url
        return None

def process_html(html):
    if html == None:
        return
    soup = BeautifulSoup(html)
    text = soup.get_text()
    # do some more processing
    # write the text to a file

q = Queue(concurrent * 2)
for i in range(concurrent):
    t = Thread(target=worker)
    t.daemon = True
    t.start()
try:
    for url in open('text.txt'):
        q.put(url.strip())
    q.join()
except KeyboardInterrupt:
    sys.exit(1)

Answer 1

If the file isn't bigger than your available memory, instead of opening it with the "open" method use mmap ( https://docs.python.org/3/library/mmap.html ). 如果文件不大于可用内存，请使用mmap（ https://docs.python.org/3/library/mmap.html ）而不是使用“ open”方法打开文件。 It will give the same speed as if you were working with memory and not a file. 它的速度与使用内存而不是文件的速度相同。

with open("test.txt") as f:
    mmap_file = mmap.mmap(f.fileno(), 0)
    # code that does what you need
    mmap_file.close()

在Python中读取和处理100,000个URL的最快方法

问题描述

1 个解决方案

解决方案1
0 2016-06-15 07:44:19

在Python中读取和处理100,000个URL的最快方法

问题描述

1 个解决方案

解决方案1 0 2016-06-15 07:44:19

解决方案1
0 2016-06-15 07:44:19