多线程 web 抓取 - 如何使其更快？

Question

我有一个包含 10 个元素的列表，每个元素都将包含在 url 中，因此应该读取 10 个 url 并获取相同的数据。 我在这里得到了一位专家的帮助，所以我尝试了多线程。 但是时间还是太长了，要怎么改进才能做到10-15秒左右呢？

ticker = ['GE', 'F', 'BAC', 'CCL', 'DAL', 'OXY', 'WFC', 'BA', 'T', 'MRO']

该列表也来自一个网站。

我花费太多时间的问题是下面的代码。 有机会我能更快地得到它吗？ 我很绝望，非常感谢您提前提供帮助。

result = []

def fetch(tick):
    url = ("https://finance.yahoo.com/quote/"+tick+"?p="+tick+"&.tsrc=fin-srch-v1")
    yahoo = requests.get(url)
    access2 = BeautifulSoup(yahoo.text,'html.parser')
    rows = access2.select('#quote-summary > div > table > tbody > tr > td > span')
    result.extend(rows)


def executor():
    threads = []
    for tick in ticker:
        t = threading.Thread(target=fetch, args=(tick,))  # Create a new thread
        t.start()  # Execute the target of the thread - fetch
        threads.append(t)

    for t in threads:
        t.join()  # Wait for the child thread to complete

    return result

Answer 1

您可以使用 gevent

这是一个很好的例子： https://sdiehl.github.io/gevent-tutorial/

import gevent.monkey
gevent.monkey.patch_socket()

import gevent
import urllib2
import simplejson as json

def fetch(tick):
    url = ("https://finance.yahoo.com/quote/"+tick+"?p="+tick+"&.tsrc=fin-srch-v1")
    yahoo = requests.get(url)
    access2 = BeautifulSoup(yahoo.text,'html.parser')
    rows = access2.select('#quote-summary > div > table > tbody > tr > td > span')
    result.extend(rows)



def asynchronous():
    threads = []
    for tick in ticker:
        threads.append(gevent.spawn(fetch, tick))
    gevent.joinall(threads)


asynchronous()

Answer 2

以这个为例。

from math import sqrt
from sklearn.cluster import MiniBatchKMeans 
import pandas_datareader as dr
from matplotlib import pyplot as plt
import pandas as pd
import matplotlib.cm as cm
import seaborn as sn

start = '2019-1-1'
end = '2020-1-1'

tickers = ['AXP','AAPL','BA','CAT','CSCO','CVX','XOM','GS','HD','IBM','INTC','JNJ','KO','JPM','MCD',    'MMM',  'MRK',  'MSFT', 'NKE','PFE','PG','TRV','UNH','RTX','VZ','V','WBA','WMT','DIS','DOW']
prices_list = []
for ticker in tickers:
    try:
        prices = dr.DataReader(ticker,'yahoo',start)['Adj Close']
        prices = pd.DataFrame(prices)
        prices.columns = [ticker]
        prices_list.append(prices)
    except:
        pass
    prices_df = pd.concat(prices_list,axis=1)
prices_df.sort_index(inplace=True)
prices_df.head()

多线程 web 抓取 - 如何使其更快？

问题描述

2 个解决方案

解决方案1
0 2020-04-29 00:02:30

解决方案2
0 2021-06-28 02:52:40

多线程 web 抓取 - 如何使其更快？

问题描述

2 个解决方案

解决方案1 0 2020-04-29 00:02:30

解决方案2 0 2021-06-28 02:52:40

解决方案1
0 2020-04-29 00:02:30

解决方案2
0 2021-06-28 02:52:40