![](/img/trans.png)
[英]Python speed optimization when downloading 1000+ zip files from URL
[英]python parallel send 1000+ url requests and get content info
我有一個函數:獲取新文章的標題和內容並追加到列表中。
我想:呼叫1000多個網址並運行該功能
目標:運行一次,我將獲得列表中的1000多個標題和內容,而不必循環瀏覽每個url並依次調用。
我到目前為止的代碼:
設定
import requests
from newspaper import Article
import threading
1)這是“工作”功能
def get_url_info(url):
try:
r = requests.head(url)
if r.status_code < 400: # if loads
article = Article(url)
article.download()
article.parse()
if detect(article.title) == 'en': #English only
if len(article.text)<50: #filter out permission request
title = (article.title.encode('ascii', errors='ignore'))
text = (article.text.encode('ascii', errors='ignore'))
test_url= url
except Exception as e:
issue = url #storing issue urls
print(e, url)
return title, text, test_url
2)這是實際添加到列表函數中:
def get_text_list():
text_list = [] #article content list
test_urls = [] #urls taht works
title_list = [] #article titles
url_list = get_tier_3()[:8000] #get first 8000 english texts for testing
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
#originally this was for url in url_list
thread.start()
"""
title, text, test_url = call do work here
title_list.append(title)
text_list.append(text)
test_urls.append(test_url)
"""
print (i) #counts number of urls from DB processed
return text_list, test_urls, title_list
問題:設置線程並從每個線程獲取信息后,我不知道如何繼續
我認為multiprocessing
模塊可能更適合此任務。 由於CPython的實現方式,它實際上無法使用threading
模塊為CPU綁定任務(如HTTP請求)實現基於線程的並發。
我還建議不要為要處理的每個URL生成單獨的線程或過程。 由於所有線程或進程都在爭奪系統資源,因此您極不可能以這種方式獲得任何性能提升。
通常,更好的解決方案是生成更少數量的線程/進程,並將一組URL委派給每個線程/進程進行處理。 使用多處理池實現此目標的簡單方法如下:
from multiprocessing import Pool
NUM_PROCS = 4 # example number of processes to be used in multiprocessing
def get_url_info(url):
...
def get_text_list():
# Get your list of URLs
url_list = get_tier_3()[:8000]
# Output lists
title_list, text_list, test_urls = [], [], []
# Initialize a multiprocessing pool that will close after finishing execution.
with Pool(NUM_PROCS) as pool:
results = pool.map(get_url_info, url_list)
for title, text, test_url in results:
title_list.append(title)
text_list.append(text)
test_urls.append(test_url)
return title_list, text_list, test_urls
我希望這有幫助!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.