简体   繁体   中英

How can i improve my multithreading speed and effciency in python?

How can improve my multithreading speed in my code?

My code takes 130 seconds with 100 threads to do 700 requests which is really slow and frustrating assuming that i use 100 threads.

My code edits the parameter values from an url and makes a request to it including the original url (unedited) the urls are received from a file (urls.txt)

Let me show you an example:

Let's consider the following url:

https://www.test.com/index.php?parameter=value1&parameter2=value2

The url contains 2 parameters so my code will make 3 requests.

1 request to the original url:

https://www.test.com/index.php?parameter=value1&parameter2=value2

1 request to the first modified value:

https://www.test.com/index.php?parameter=replaced_value&parameter2=value2

1 request to the second modified value:

https://www.test.com/index.php?parameter=value1&parameter2=replaced_value

I have tried using asyncio for this but I had more success with concurrent.futures

I even tried increasing the threads which I thought it was the issue at first but in this case wasnt if I would increase the threads considerably then the script would freeze at start for 30-50 seconds and it really didnt increased the speed as i expected

I assume this is a code issue how I build up the multithreading becuase I saw other people achieved incredible speeds with concurrent.futures

import requests
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

start = time.time()

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
def make_request(url2):
    try:
        if '?' and '=':
            request_1 = requests.get(url2, headers=headers, timeout=10)
            url2_modified = url2.split("?")[1]
            times = url2_modified.count("&") + 1
            for x in range(0, times):
                split1 = url2_modified.split("&")[x]
                value = split1.split("=")[1]
                parameter = split1.split("=")[0]
                url = url2.replace('='+value, '=1') 
                request_2 = requests.get(url, stream=True, headers=headers, timeout=10)
                html_1 = request_1.text
                html_2 = request_2.text
                print(request_1.status_code + ' - ' + url2)
                print(request_2.status_code + ' - ' + url)

    except requests.exceptions.RequestException as e:
       return e


def runner():
    threads= []
    with ThreadPoolExecutor(max_workers=100) as executor:
        file1 = open('urls.txt', 'r', errors='ignore')
        Lines = file1.readlines()   
        count = 0
        for line in Lines:
            count += 1
            threads.append(executor.submit(make_request, line.strip()))
      
runner()

end = time.time()
print(end - start)

Inside loop in make_request you run normal requests.get and it doesn't use thread (or any other method) to make it faster - so it has to wait for end of previous request to run next request.

In make_request I use another ThreadPoolExecutor to run every requests.get (created in loop) in separated thread

executor.submit(make_modified_request, modified_url) 

and it gives me time ~1.2s

If I use normal

make_modified_request(modified_url)

then it gives me time ~3.2s


Minimal working example:

I use real urls https://httpbin.org/get so everyone can simply copy and run it.

from concurrent.futures import ThreadPoolExecutor
import requests
import time
#import urllib.parse

# --- constansts --- (PEP8: UPPER_CASE_NAMES)

HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}

# --- functions ---

def make_modified_request(url):
    """Send modified url."""

    print('send:', url)
    response = requests.get(url, stream=True, headers=HEADERS)
    print(response.status_code, '-', url)
    html = response.text   # ???
    # ... code to process HTML ...

def make_request(url):
    """Send normal url and create threads with modified urls."""

    threads = []

    with ThreadPoolExecutor(max_workers=10) as executor:
            print('send:', url)

            # send base url            
            response = requests.get(url, headers=HEADERS)
            print(response.status_code, '-', url)
            html = response.text   # ???

            #parts = urllib.parse.urlparse(url)
            #print('query:',  parts.query)
            #arguments = urllib.parse.parse_qs(parts.query)
            #print('arguments:', arguments)   # dict  {'a': ['A'], 'b': ['B'], 'c': ['C'], 'd': ['D'], 'e': ['E']}

            arguments = url.split("?")[1]
            arguments = arguments.split("&")
            arguments = [arg.split("=") for arg in arguments]
            print('arguments:', arguments)    # list [['a', 'A'], ['b', 'B'], ['c', 'C'], ['d', 'D'], ['e', 'E']]
             
            for name, value in arguments:
                modified_url = url.replace('='+value, '=1')
                print('modified_url:', modified_url)
                
                # run thread with modified url
                threads.append(executor.submit(make_modified_request, modified_url))
                
                # run normal function with modified url 
                #make_modified_request(modified_url)

    print('[make_request] len(threads):', len(threads))
    
def runner():
    threads = []
    
    with ThreadPoolExecutor(max_workers=10) as executor:
        #fh = open('urls.txt', errors='ignore')
        fh = [
            'https://httpbin.org/get?a=A&b=B&c=C&d=D&e=E', 
            'https://httpbin.org/get?f=F&g=G&h=H&i=I&j=J',
            'https://httpbin.org/get?k=K&l=L&m=M&n=N&o=O',
            'https://httpbin.org/get?a=A&b=B&c=C&d=D&e=E', 
            'https://httpbin.org/get?f=F&g=G&h=H&i=I&j=J',
            'https://httpbin.org/get?k=K&l=L&m=M&n=N&o=O',
           ]

        for line in fh:
            url = line.strip()
            # create thread with url
            threads.append(executor.submit(make_request, url))

    print('[runner] len(threads):', len(threads))

# --- main ---

start = time.time()

runner()

end = time.time()
print('time:', end - start)

BTW:

I was thinking to use single

executor = ThreadPoolExecutor(max_workers=10)

and later use the same executor in all functions - and maybe it would run little faster - but at this moment I don't have working code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM