Python - Multiple HTTP requests too slow

Question

I am new to Python programming.

I am trying to parse the HTTP requests from Instagram to find a specific word using regular expressions.

I've used multiprocessing, but still it's SLOW. I know my code might look stupid, but that's my best.

What am I doing wrong that makes it slow? I need to make it send multiple HTTP requests faster.

import requests
import re 
import time
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool  
from multiprocessing import cpu_count


Nthreads = cpu_count()*2
pool = Pool(Nthreads)


f = open('full.txt','r')
fw = open('out.txt', 'w')


def findSnap(bio):
    regex = 'content=".*sn[a]*p[a-z]*\s*[^a-z0-9].*'
    snap = re.findall(regex, bio)
    if not snap:
        return None
    else:
        afterSnap = re.sub('content=".*sn[a]*p[a-z]*\s*[^a-z0-9]*\s*','',snap[0])
        if afterSnap:
            afterSnap = re.findall('[\w_\.-]*',afterSnap)[0]
            sftS = afterSnap.split()
            if sftS:
                return sftS[0]
            return None
        return None

def loadInfo(url):
    #print 'Loading data..'
    st = time.time
    try:
        page = requests.get(url).text.lower()
    except Exception as e:
        print('Something is wrong!')
        return None


    snap = findSnap(page)
    if snap:
        fw.write(snap + '\n')
        fw.flush()
        print(snap)
    else:
        return None
    return snap

start = time.time()
names = f.read().splitlines()
baseUrl = 'https://instagram.com/'
urls = map(lambda x: baseUrl + x, names)

pool.map(loadInfo, urls)
finish = time.time()

print((finish- start)/60)
fw.close()

Answer 1

As some people is saying, maybe we need some more details about what times are you getting, what are you expecting and why you expect that. Because there are a lot factors that can be involved in the execution time of your application, not just your code, because your application depends on a third-party resource.

In any case, I have seen you are using multiprocessing.dummy , this is just a wrapper of threading module [ 1 ]. Following its documentation, it seems it is not the best module you can use to run regular Python code at same time[ 2 ]:

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

It is true that in your case you are making I/O operations, but processing regex is also a heavy task.

As it is said in the text, you can try to use different implementations of pools in multiprocessing module other than dummy or you can concurrent.futures.ProcessPoolExecutor also.

Python - Multiple HTTP requests too slow

Question

1 answers

solution1
0 2019-02-21 12:48:10

Python - Multiple HTTP requests too slow

Question

1 answers

solution1 0 2019-02-21 12:48:10

solution1
0 2019-02-21 12:48:10