python parallel發送1000多個url請求並獲取內容信息

Question

我有一個函數：獲取新文章的標題和內容並追加到列表中。

我想：呼叫1000多個網址並運行該功能

目標：運行一次，我將獲得列表中的1000多個標題和內容，而不必循環瀏覽每個url並依次調用。

我到目前為止的代碼：

設定

import requests
from newspaper import Article
import threading

1）這是“工作”功能

def get_url_info(url):
    try:
        r = requests.head(url) 
        if r.status_code < 400: # if loads
            article = Article(url)
            article.download()
            article.parse()
            if detect(article.title) == 'en': #English only
                if len(article.text)<50: #filter out permission request
                    title = (article.title.encode('ascii', errors='ignore')) 
                    text = (article.text.encode('ascii', errors='ignore'))
                    test_url= url
    except Exception as e:    
        issue = url #storing issue urls
        print(e, url) 

return title, text, test_url

2）這是實際添加到列表函數中：

def get_text_list():
    text_list = [] #article content list
    test_urls = [] #urls taht works
    title_list = [] #article titles
    url_list = get_tier_3()[:8000]  #get first 8000 english texts for testing
    threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
    for thread in threads:
    #originally this was for url in url_list
        thread.start()
        """
        title, text, test_url = call do work here
        title_list.append(title)
        text_list.append(text)
        test_urls.append(test_url)
        """
        print (i) #counts number of urls from DB processed

return text_list, test_urls, title_list

問題：設置線程並從每個線程獲取信息后，我不知道如何繼續

Answer 1

我認為multiprocessing模塊可能更適合此任務。 由於CPython的實現方式，它實際上無法使用threading模塊為CPU綁定任務（如HTTP請求）實現基於線程的並發。

我還建議不要為要處理的每個URL生成單獨的線程或過程。 由於所有線程或進程都在爭奪系統資源，因此您極不可能以這種方式獲得任何性能提升。

通常，更好的解決方案是生成更少數量的線程/進程，並將一組URL委派給每個線程/進程進行處理。 使用多處理池實現此目標的簡單方法如下：

from multiprocessing import Pool

NUM_PROCS = 4  # example number of processes to be used in multiprocessing

def get_url_info(url):
    ...

def get_text_list():

    # Get your list of URLs
    url_list = get_tier_3()[:8000]

    # Output lists
    title_list, text_list, test_urls = [], [], []

    # Initialize a multiprocessing pool that will close after finishing execution.
    with Pool(NUM_PROCS) as pool:
        results = pool.map(get_url_info, url_list)

    for title, text, test_url in results:
        title_list.append(title)
        text_list.append(text)
        test_urls.append(test_url)

    return title_list, text_list, test_urls

我希望這有幫助！

python parallel發送1000多個url請求並獲取內容信息

問題描述

1 個解決方案

解決方案1
0 2018-03-06 05:12:05

python parallel發送1000多個url請求並獲取內容信息

問題描述

1 個解決方案

解決方案1 0 2018-03-06 05:12:05

解決方案1
0 2018-03-06 05:12:05