刮刀function上的多核處理

Question

我希望通過使用多個核心來加速我的刮刀，以便多個核心可以從我使用預定義的 function scrape的列表中的 URL 中刮取。 我該怎么做？

這是我當前的代碼：

for x in URLs['identifier'][1:365]:
test= scrape(x)
results = test.get_results
results['identifier'] = x
final= final.append(results)

Answer 1

像這樣的東西，（或者你也可以使用 Scrapy）

如果服務器也可以處理它，它將很容易讓您並行發出大量請求；

# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map, process_map # for multi-threading, multi-processing respectively)

def chunk_list(lst, size):
    for i in range(0, len(lst), size):
        yield lst[i:i + size]

for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
    for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
        # which_func_to_call -> wrap the returned response json obj in this, etc
        # do something with the response now..
        # make sure to cache the chunk results as well (in case you are having lot of them)

或者

使用 Python 中的多處理模塊中的池。

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com/page/'

all_urls = list()

def generate_urls():
    # better to yield them as well if you already have the URL's list etc..
    for i in range(1,11):
        all_urls.append(base_url + str(i))
    
def scrape(url):
    res = requests.get(url)
    print(res.status_code, res.url)

generate_urls()

p = Pool(10)
p.map(scrape, all_urls)
p.terminate()
p.join()

刮刀function上的多核處理

問題描述

1 個解決方案

解決方案1
0 2020-07-19 05:27:11

刮刀function上的多核處理

問題描述

1 個解決方案

解決方案1 0 2020-07-19 05:27:11

解決方案1
0 2020-07-19 05:27:11