In the code below, I am considering using mutli-threading or multi-process for fetching from url. I think pools would be ideal, Can anyone help suggest solution..
Idea: pool thread/process, collect data... my preference is process over thread, but not sure.
import urllib
URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')
def fetch_quote(symbols):
url = URL % '+'.join(symbols)
fp = urllib.urlopen(url)
try:
data = fp.read()
finally:
fp.close()
return data
def main():
data_fp = fetch_quote(symbols)
# print data_fp
if __name__ =='__main__':
main()
So here's a very simple example. It iterates over symbols passing one at a time to fetch_quote.
import urllib
import multiprocessing
URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')
def fetch_quote(symbol):
url = URL % '+'.join(symbol)
fp = urllib.urlopen(url)
try:
data = fp.read()
finally:
fp.close()
return data
def main():
PROCESSES = 4
print 'Creating pool with %d processes\n' % PROCESSES
pool = multiprocessing.Pool(PROCESSES)
print 'pool = %s' % pool
print
results = [pool.apply_async(fetch_quote, sym) for sym in symbols]
print 'Ordered results using pool.apply_async():'
for r in results:
print '\t', r.get()
pool.close()
pool.join()
if __name__ =='__main__':
main()
You have a process that request, several information at once. Let's try to fetch these information one by one.. Your code will be :
def fetch_quote(symbols):
url = URL % '+'.join(symbols)
fp = urllib.urlopen(url)
try:
data = fp.read()
finally:
fp.close()
return data
def main():
for symbol in symbols:
data_fp = fetch_quote((symbol,))
print data_fp
if __name__ == "__main__":
main()
So main() call, one by one every url to get the data. Let's multiprocess it with a pool:
import urllib
from multiprocessing import Pool
URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
def fetch_quote(symbols):
url = URL % '+'.join(symbols)
fp = urllib.urlopen(url)
try:
data = fp.read()
finally:
fp.close()
return data
def main():
for symbol in symbols:
data_fp = fetch_quote((symbol,))
print data_fp
if __name__ =='__main__':
pool = Pool(processes=5)
for symbol in symbols:
result = pool.apply_async(fetch_quote, [(symbol,)])
print result.get(timeout=1)
In the following main a new process is created to request each symbols urls.
Note: on python, since the GIL is present, multithreading must be mostly considered as a wrong solution.
For documentation see: Multiprocessing in python
Actually it's possible to do it without neither. You can get it done in one thread using asynchronous calls, like for example twisted.web.client.getPage
from Twisted Web .
As you would know multi-threading in Python is not actually multi-threading due to GIL. Essentially it's a single thread that's running at a given time. So in your program if you want multiple urls to be fetched at any given time, multi-threading might not be the way to go. Also after the crawl you store the data in a single file or some persistent db? The decision here could affect your performance.
multi-processes are more efficient that way but have the time & memory overhead of extra processes spawn. I have explored both these options in Python recently. Here's the url (with code) -
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.