[英]Multithreaded web scraping - how to make it faster?
我有一个包含 10 个元素的列表,每个元素都将包含在 url 中,因此应该读取 10 个 url 并获取相同的数据。 我在这里得到了一位专家的帮助,所以我尝试了多线程。 但是时间还是太长了,要怎么改进才能做到10-15秒左右呢?
ticker = ['GE', 'F', 'BAC', 'CCL', 'DAL', 'OXY', 'WFC', 'BA', 'T', 'MRO']
该列表也来自一个网站。
我花费太多时间的问题是下面的代码。 有机会我能更快地得到它吗? 我很绝望,非常感谢您提前提供帮助。
result = []
def fetch(tick):
url = ("https://finance.yahoo.com/quote/"+tick+"?p="+tick+"&.tsrc=fin-srch-v1")
yahoo = requests.get(url)
access2 = BeautifulSoup(yahoo.text,'html.parser')
rows = access2.select('#quote-summary > div > table > tbody > tr > td > span')
result.extend(rows)
def executor():
threads = []
for tick in ticker:
t = threading.Thread(target=fetch, args=(tick,)) # Create a new thread
t.start() # Execute the target of the thread - fetch
threads.append(t)
for t in threads:
t.join() # Wait for the child thread to complete
return result
您可以使用 gevent
这是一个很好的例子: https://sdiehl.github.io/gevent-tutorial/
import gevent.monkey
gevent.monkey.patch_socket()
import gevent
import urllib2
import simplejson as json
def fetch(tick):
url = ("https://finance.yahoo.com/quote/"+tick+"?p="+tick+"&.tsrc=fin-srch-v1")
yahoo = requests.get(url)
access2 = BeautifulSoup(yahoo.text,'html.parser')
rows = access2.select('#quote-summary > div > table > tbody > tr > td > span')
result.extend(rows)
def asynchronous():
threads = []
for tick in ticker:
threads.append(gevent.spawn(fetch, tick))
gevent.joinall(threads)
asynchronous()
以这个为例。
from math import sqrt
from sklearn.cluster import MiniBatchKMeans
import pandas_datareader as dr
from matplotlib import pyplot as plt
import pandas as pd
import matplotlib.cm as cm
import seaborn as sn
start = '2019-1-1'
end = '2020-1-1'
tickers = ['AXP','AAPL','BA','CAT','CSCO','CVX','XOM','GS','HD','IBM','INTC','JNJ','KO','JPM','MCD', 'MMM', 'MRK', 'MSFT', 'NKE','PFE','PG','TRV','UNH','RTX','VZ','V','WBA','WMT','DIS','DOW']
prices_list = []
for ticker in tickers:
try:
prices = dr.DataReader(ticker,'yahoo',start)['Adj Close']
prices = pd.DataFrame(prices)
prices.columns = [ticker]
prices_list.append(prices)
except:
pass
prices_df = pd.concat(prices_list,axis=1)
prices_df.sort_index(inplace=True)
prices_df.head()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.