python - handling HTTP requests asynchronously

Question

I need to generate a PDF report for each entry of a django queryset. There'll be between between 30k and 40k entries.

The PDF is generated through an external API. Since currently is generated on demand, this is handled synchronously via an HTTP request/response. That will be different for this task, since I think I'll use a django management command to loop through the queryset and perform the PDF generation.

Which approach should I follow for this task? I thought about 2 possibile solutions, although are technologies that I never used:

1) Celery : assign a task (http request with a different payload) to a worker, then retrieve it once it's done.

2) request-futures : using requests in a non-blocking way.

the goal is to use the API concurrently (eg send 10 or 100 http requests simultaneously, depending on how many concurrent requests the API can handle).

Anybody here that handled a similar task and can give advices on how to proceed on this?

The following is a first attempt, made with multiprocessing (NOTE: most of the code is reused and not written by myself, as I took ownership of this project. ):

class Checker(object):

    def __init__(self, *args, **kwargs):
        # ... various setup

    # other methods
    # .....

    def run_single(self, uuid, verbose=False):
        """
        run a single PDF generation and local download
        """
        start = timer()
        headers = self.headers

        data, obj = self.get_review_data(uuid)
        if verbose: 
            print("** Report: {} **".format(obj))
        response = requests.post(
            url=self.endpoint_url,
            headers=headers,
            data=json.dumps(data)
        )
        if verbose:
            print('POST - Response: {} \n {} \n {} secs'.format(
                response.status_code,
                response.content,
                response.elapsed.total_seconds())
            )
        run_url = self.check_progress(post_response=response, verbose=True)
        if run_url:
            self.get_file(run_url, obj, verbose=True)
        print("*** Download {}in {} secs".format("(verbose) " if verbose else "", timer()-start))


    def run_all(self, uuids, verbose=True):
        start = timer()
        for obj_uuid in review_uuids:
            self.run_single(obj_uuid, verbose=verbose)
        print("\n\n### Downloaded {}{} reviews in {} secs".format(
            "(verbose) " if verbose else "",
            len(uuids),
            timer() - start)
        )

    def run_all_multi(self, uuids, workers=4, verbose=True):
        pool = Pool(processes=workers)
        pool.map(self.run_single, uuids)


    def check_progress(self, post_response, attempts_limit=10000, verbose=False):
        """
        check the progress of PDF generation querying periodically the API endpoint
        """
        if post_response.status_code != 200:
            if verbose: print("POST response status code != 200 - exit")
            return None
        url = 'https://apidomain.com/{path}'.format(
            domain=self.domain,
            path=post_response.json().get('links', {}).get('self', {}).get('href'),
            headers = self.headers
        )
        job_id = post_response.json().get('jobId', '')
        status = 'Running'
        attempt_counter = 0
        start = timer()
        if verbose: 
            print("GET - url: {}".format(url))
        while status == 'Running':
            attempt_counter += 1
            job_response = requests.get(
                url=url,
                headers=self.headers,
            )
            job_data = job_response.json()
            status = job_data['status']
            message = job_data['message']
            progress = job_data['progress']
            if status == 'Error':
                if verbose:
                    end = timer()
                    print(
                        '{sc} - job_id: {job_id} - error_id: [{error_id}]: {message}'.format(
                            sc=job_response.status_code, 
                            job_id=job_id,
                            error_id=job_data['errorId'], 
                            message=message
                        ), '{} secs'.format(end - start)
                    )
                    print('Attempts: {} \n {}% progress'.format(attempt_counter, progress))
                return None
            if status == 'Complete':
                if verbose:
                    end = timer()
                    print('run_id: {run_id} - Complete - {secs} secs'.format(
                        run_id=run_id,
                        secs=end - start)
                    )
                    print('Attempts: {}'.format(attempt_counter))
                    print('{url}/files/'.format(url=url))
                return '{url}/files/'.format(url=url)
            if attempt_counter >= attempts_limit:
                if verbose:
                    end = timer()
                    print('File failed to generate after {att_limit} retrieve attempts: ({progress}% progress)' \
                          ' - {message}'.format(
                              att_limit=attempts_limit,
                              progress=int(progress * 100),
                              message=message
                          ), '{} secs'.format(end-start))
                return None
            if verbose:
                print('{}% progress  - attempts: {}'.format(progress, attempt_counter), end='\r')
                sys.stdout.flush()
            time.sleep(1)
        if verbose:
            end = timer()
            print(status, 'message: {} - attempts: {} - {} secs'.format(message, attempt_counter, end - start))
        return None

    def get_review_data(self, uuid, host=None, protocol=None):
        review = get_object_or_404(MyModel, uuid)
        internal_api_headers = {
            'Authorization': 'Token {}'.format(
                review.employee.csod_profile.csod_user_token
            )
        }

        data = requests.get(
            url=a_local_url,
            params={'format': 'json', 'indirect': 'true'},
            headers=internal_api_headers,
        ).json()
        return (data, review)

    def get_file(self, runs_url, obj, verbose=False):

        runs_files_response = requests.get(
            url=runs_url,
            headers=self.headers,
            stream=True,
        )

        runs_files_data = runs_files_response.json()


        file_path = runs_files_data['files'][0]['links']['file']['href'] # remote generated file URI
        file_response_url = 'https://apidomain.com/{path}'.format(path=file_path)
        file_response = requests.get(
            url=file_response_url,
            headers=self.headers,
            params={'userId': settings.CREDENTIALS['userId']},
            stream=True,
        )
        if file_response.status_code != 200:
            if verbose:
                print('error in retrieving file for {r_id}\nurl: {url}\n'.format(
                    r_id=obj.uuid, url=file_response_url)
                )
        local_file_path = '{temp_dir}/{uuid}-{filename}-{employee}.pdf'.format(
            temp_dir=self.local_temp_dir,
            uuid=obj.uuid,
            employee=slugify(obj.employee.get_full_name()),
            filename=slugify(obj.task.name)
        )
        with open(local_file_path, 'wb') as f:
            for block in file_response.iter_content(1024):
                f.write(block)
            if verbose:
                print('\n --> {r} [{uuid}]'.format(r=review, uuid=obj.uuid))
                print('\n --> File downloaded: {path}'.format(path=local_file_path))

    @classmethod
    def get_temp_directory(self):
        """
        generate a local unique temporary directory
        """
        return '{temp_dir}/'.format(
            temp_dir=mkdtemp(dir=TEMP_DIR_PREFIX),
        )

if __name__ == "__main__":
    uuids = #list or generator of objs uuids
    checker = Checker()
    checker.run_all_multi(uuids=uuids)

Unfortunately, running checker.run_all_multi have the following effects

the python shell freeze;
no output is printed;
no file is generated;
I have to kill the console from command line, the normal keyboard interrupt stops to work

while running checker.run_all does the job normally (one by one).

Any suggestion specifically about why this code doesn't work (and not about what I could use instead of multiprocessing)?

Thanks everyone.

Answer 1

With your frequency, once a year & manually. You don't need Celery or request-futures.

Create a method like

def record_to_pdf(record):
    # create pdf from record

Then create a management command with code (using multiprocessing.Pool )

from multiprocessing import Pool
pool = Pool(processes=NUMBER_OF_CORES)
pool.map(record_to_pdf, YOUR_QUERYSET)

Management command will not be asynchronous though. To make it asynchronous you can run it in the background.

Also, if your processes is not CPU bound (like, it is only calling some API) then as @Anentropic suggested you can experiment with higher number of processes when creating pool.

python - handling HTTP requests asynchronously

Question

1 answers

solution1
1 2016-05-04 15:45:14

python - handling HTTP requests asynchronously

Question

1 answers

solution1 1 2016-05-04 15:45:14

solution1
1 2016-05-04 15:45:14