I am new to Python programming.
I am trying to parse the HTTP requests from Instagram to find a specific word using regular expressions.
I've used multiprocessing, but still it's SLOW. I know my code might look stupid, but that's my best.
What am I doing wrong that makes it slow? I need to make it send multiple HTTP requests faster.
import requests
import re
import time
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool
from multiprocessing import cpu_count
Nthreads = cpu_count()*2
pool = Pool(Nthreads)
f = open('full.txt','r')
fw = open('out.txt', 'w')
def findSnap(bio):
regex = 'content=".*sn[a]*p[a-z]*\s*[^a-z0-9].*'
snap = re.findall(regex, bio)
if not snap:
return None
else:
afterSnap = re.sub('content=".*sn[a]*p[a-z]*\s*[^a-z0-9]*\s*','',snap[0])
if afterSnap:
afterSnap = re.findall('[\w_\.-]*',afterSnap)[0]
sftS = afterSnap.split()
if sftS:
return sftS[0]
return None
return None
def loadInfo(url):
#print 'Loading data..'
st = time.time
try:
page = requests.get(url).text.lower()
except Exception as e:
print('Something is wrong!')
return None
snap = findSnap(page)
if snap:
fw.write(snap + '\n')
fw.flush()
print(snap)
else:
return None
return snap
start = time.time()
names = f.read().splitlines()
baseUrl = 'https://instagram.com/'
urls = map(lambda x: baseUrl + x, names)
pool.map(loadInfo, urls)
finish = time.time()
print((finish- start)/60)
fw.close()
As some people is saying, maybe we need some more details about what times are you getting, what are you expecting and why you expect that. Because there are a lot factors that can be involved in the execution time of your application, not just your code, because your application depends on a third-party resource.
In any case, I have seen you are using multiprocessing.dummy
, this is just a wrapper of threading
module [ 1 ]. Following its documentation, it seems it is not the best module you can use to run regular Python code at same time[ 2 ]:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
It is true that in your case you are making I/O operations, but processing regex is also a heavy task.
As it is said in the text, you can try to use different implementations of pools in multiprocessing
module other than dummy
or you can concurrent.futures.ProcessPoolExecutor
also.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.