繁体   English   中英

多线程 web 抓取 - 如何使其更快?

[英]Multithreaded web scraping - how to make it faster?

我有一个包含 10 个元素的列表,每个元素都将包含在 url 中,因此应该读取 10 个 url 并获取相同的数据。 我在这里得到了一位专家的帮助,所以我尝试了多线程。 但是时间还是太长了,要怎么改进才能做到10-15秒左右呢?

ticker = ['GE', 'F', 'BAC', 'CCL', 'DAL', 'OXY', 'WFC', 'BA', 'T', 'MRO']

该列表也来自一个网站。

我花费太多时间的问题是下面的代码。 有机会我能更快地得到它吗? 我很绝望,非常感谢您提前提供帮助。

result = []

def fetch(tick):
    url = ("https://finance.yahoo.com/quote/"+tick+"?p="+tick+"&.tsrc=fin-srch-v1")
    yahoo = requests.get(url)
    access2 = BeautifulSoup(yahoo.text,'html.parser')
    rows = access2.select('#quote-summary > div > table > tbody > tr > td > span')
    result.extend(rows)


def executor():
    threads = []
    for tick in ticker:
        t = threading.Thread(target=fetch, args=(tick,))  # Create a new thread
        t.start()  # Execute the target of the thread - fetch
        threads.append(t)

    for t in threads:
        t.join()  # Wait for the child thread to complete

    return result

您可以使用 gevent

这是一个很好的例子: https://sdiehl.github.io/gevent-tutorial/

import gevent.monkey
gevent.monkey.patch_socket()

import gevent
import urllib2
import simplejson as json

def fetch(tick):
    url = ("https://finance.yahoo.com/quote/"+tick+"?p="+tick+"&.tsrc=fin-srch-v1")
    yahoo = requests.get(url)
    access2 = BeautifulSoup(yahoo.text,'html.parser')
    rows = access2.select('#quote-summary > div > table > tbody > tr > td > span')
    result.extend(rows)



def asynchronous():
    threads = []
    for tick in ticker:
        threads.append(gevent.spawn(fetch, tick))
    gevent.joinall(threads)


asynchronous()

以这个为例。

from math import sqrt
from sklearn.cluster import MiniBatchKMeans 
import pandas_datareader as dr
from matplotlib import pyplot as plt
import pandas as pd
import matplotlib.cm as cm
import seaborn as sn

start = '2019-1-1'
end = '2020-1-1'

tickers = ['AXP','AAPL','BA','CAT','CSCO','CVX','XOM','GS','HD','IBM','INTC','JNJ','KO','JPM','MCD',    'MMM',  'MRK',  'MSFT', 'NKE','PFE','PG','TRV','UNH','RTX','VZ','V','WBA','WMT','DIS','DOW']
prices_list = []
for ticker in tickers:
    try:
        prices = dr.DataReader(ticker,'yahoo',start)['Adj Close']
        prices = pd.DataFrame(prices)
        prices.columns = [ticker]
        prices_list.append(prices)
    except:
        pass
    prices_df = pd.concat(prices_list,axis=1)
prices_df.sort_index(inplace=True)
prices_df.head()

在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM