简体   繁体   中英

Python - Multiple HTTP requests too slow

I am new to Python programming.

I am trying to parse the HTTP requests from Instagram to find a specific word using regular expressions.

I've used multiprocessing, but still it's SLOW. I know my code might look stupid, but that's my best.

What am I doing wrong that makes it slow? I need to make it send multiple HTTP requests faster.

import requests
import re 
import time
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool  
from multiprocessing import cpu_count


Nthreads = cpu_count()*2
pool = Pool(Nthreads)


f = open('full.txt','r')
fw = open('out.txt', 'w')


def findSnap(bio):
    regex = 'content=".*sn[a]*p[a-z]*\s*[^a-z0-9].*'
    snap = re.findall(regex, bio)
    if not snap:
        return None
    else:
        afterSnap = re.sub('content=".*sn[a]*p[a-z]*\s*[^a-z0-9]*\s*','',snap[0])
        if afterSnap:
            afterSnap = re.findall('[\w_\.-]*',afterSnap)[0]
            sftS = afterSnap.split()
            if sftS:
                return sftS[0]
            return None
        return None

def loadInfo(url):
    #print 'Loading data..'
    st = time.time
    try:
        page = requests.get(url).text.lower()
    except Exception as e:
        print('Something is wrong!')
        return None


    snap = findSnap(page)
    if snap:
        fw.write(snap + '\n')
        fw.flush()
        print(snap)
    else:
        return None
    return snap

start = time.time()
names = f.read().splitlines()
baseUrl = 'https://instagram.com/'
urls = map(lambda x: baseUrl + x, names)

pool.map(loadInfo, urls)
finish = time.time()

print((finish- start)/60)
fw.close()

As some people is saying, maybe we need some more details about what times are you getting, what are you expecting and why you expect that. Because there are a lot factors that can be involved in the execution time of your application, not just your code, because your application depends on a third-party resource.

In any case, I have seen you are using multiprocessing.dummy , this is just a wrapper of threading module [ 1 ]. Following its documentation, it seems it is not the best module you can use to run regular Python code at same time[ 2 ]:

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

It is true that in your case you are making I/O operations, but processing regex is also a heavy task.

As it is said in the text, you can try to use different implementations of pools in multiprocessing module other than dummy or you can concurrent.futures.ProcessPoolExecutor also.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM