[英]Fastest way to read and process 100,000 URLs in Python
I have a file with 100,000 URLs that I need to request then process. 我有一个包含100,000个URL的文件,我需要先请求然后进行处理。 The processing takes a non-negligible amount of time compared to the request, so simply using multithreading seems to only give me a partial speed-up. 与请求相比,该处理花费了不可忽略的时间,因此简单地使用多线程似乎仅能部分加快速度。 From what I have read, I think using the multiprocessing
module, or something similar, would offer a more substantial speed-up because I could use multiple cores. 从我的阅读中,我认为使用multiprocessing
模块或类似的模块将提供更大的提速,因为我可以使用多个内核。 I'm guessing I want to use some multiple processes, each with multiple threads, but I'm not sure how to do that. 我猜想要使用多个进程,每个进程都有多个线程,但是我不确定该怎么做。
Here is my current code, using threading (based on What is the fastest way to send 100,000 HTTP requests in Python? ): 这是我当前使用线程的代码(基于Python中发送100,000个HTTP请求的最快方法是什么? ):
from threading import Thread
from Queue import Queue
import requests
from bs4 import BeautifulSoup
import sys
concurrent = 100
def worker():
while True:
url = q.get()
html = get_html(url)
process_html(html)
q.task_done()
def get_html(url):
try:
html = requests.get(url, timeout=5, headers={'Connection':'close'}).text
return html
except:
print "error", url
return None
def process_html(html):
if html == None:
return
soup = BeautifulSoup(html)
text = soup.get_text()
# do some more processing
# write the text to a file
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=worker)
t.daemon = True
t.start()
try:
for url in open('text.txt'):
q.put(url.strip())
q.join()
except KeyboardInterrupt:
sys.exit(1)
If the file isn't bigger than your available memory, instead of opening it with the "open" method use mmap ( https://docs.python.org/3/library/mmap.html ). 如果文件不大于可用内存,请使用mmap( https://docs.python.org/3/library/mmap.html )而不是使用“ open”方法打开文件。 It will give the same speed as if you were working with memory and not a file. 它的速度与使用内存而不是文件的速度相同。
with open("test.txt") as f:
mmap_file = mmap.mmap(f.fileno(), 0)
# code that does what you need
mmap_file.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.