简体   繁体   中英

How to accept concurrent request using Gunicorn for Flask API?

I want to accept multiple concurrent request for Flask API. API is currently getting "company name" through POST method and call the crawler engine, and each crawling process takes 5-10 minutes to finish. I want to run many crawler engine in parallel for different respective that many request. I followed this , but could not get it working. Currently, second request is cancelling the first request. How can I achieve this parallelism?

Current API implementation:

app.py

app = Flask(__name__)
app.debug = True

@app.route("/api/v1/crawl", methods=['POST'])
def crawl_end_point():
    if not request.is_json:
        abort(415)

    inputs = CompanyNameSchema(request)
    if not inputs.validate():
        return jsonify(success=False, errros=inputs.errors)

    data = request.get_json()
    company_name = data.get("company_name")
    print(company_name)
    if company_name is not None:
        search = SeedListGenerator(company_name)
        search.start_crawler()

        scrap = RunAllScrapper(company_name)
        scrap.start_all()
        subprocess.call(['/bin/bash', '-i', '-c', 'myconda;scrapy crawl company_profiler;'])
    return 'Data Pushed successfully to Solr Index!', 201

if __name__ == "__main__":
    app.run(host="10.250.36.52", use_reloader=True, threaded=True)

gunicorn.sh

#!/bin/bash
NAME="Crawler-API"
FLASKDIR=/root/Public/company_profiler
SOCKFILE=/root/Public/company_profiler/sock
LOG=./logs/gunicorn/gunicorn.log
PID=./guincorn.pid

user=root
GROUP=root

NUM_WORKERS=10 #  generally in the 2-4 x $(NUM_CORES)+1 range
TIMEOUT=1200
#preload_apps = False

# The maximum number of requests a worker will process before restarting.
MAX_REQUESTS=0


echo "Starting $NAME"

# Create the run directory if it doesn't exist
RUNDIR=$(dirname $SOCKFILE)
test -d $RUNDIR || mkdir -p $RUNDIR

# Start your gunicorn
exec gunicorn app:app -b 0.0.0.0:5000 \
  --name $NAME \
  --worker-class gevent \
  --workers 5 \
  --keep-alive 900 \
  --graceful-timeout 1200 \
  --worker-connections 5 \
  --user=$USER --group=$GROUP \
  --bind=unix:$SOCKFILE \
  --log-level info \
  --backlog 0 \
  --pid=$PID \
  --access-logformat='%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s' \
  --error-logfile $LOG \
  --log-file=-

Thanks in advance!

Better way - using Job Queue with Redis or something similar. You will can create queues for jobs, get results and organize exchange with frontend via API requests. Every job will be working in separate process without stuck main application. In other case you will need resolving problems with bottlenecks on every step.

Good implementation - RQ lib or flask-rq fo Redis.

http://python-rq.org/

  1. Start instance of Redis (i'm using docker for it)
  2. Write your own worker like this:
import redis        
from rq import Worker, Queue, Connection  

listen = ['high', 'default', 'low']

redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')

conn = redis.from_url(redis_url)

if __name__ == '__main__':
    with Connection(conn):                                               
        worker = Worker(map(Queue, listen))
        worker.work()     
  1. Start workers via flask or via console(better for debug process) and create job in queue and controling results.
from redis import Redis
from rq import Queue

q = Queue(connection=Redis())

def crawl_end_point():
   ... 

#adding task to queue
result = q.enqueue(crawl_end_point, timeout=3600)
#simplest way save id of job
session['j_id'] = result.get_id()
#get job status
job = Job.fetch(session['j_id'], connection=conn)
job.get_status()  
#get job results
job.result

Also you can check Celery for this purposes: https://stackshare.io/stackups/celery-vs-redis

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM