How to speed up returning a 20MB Json file from a Python-Flask application?

Question

I am trying to call an API which in turn triggers a store procedure from our sqlserver database. This is how I coded it.

class Api_Name(Resource):

    def __init__(self):
        pass

    @classmethod
    def get(self):
        try:
            engine = database_engine
            connection = engine.connect()
            sql = "DECLARE @return_value int EXEC @return_value = [dbname].[dbo].[proc_name])
            return call_proc(sql, apiname, starttime, connection)
        except Exception as e:
            return {'message': 'Proc execution failed with error => {error}'.format(error=e)}, 400
        pass

call_proc is the method where I return the JSON from database.

def call_proc(sql: str, connection):
    try:
        json_data = []
        rv = connection.execute(sql)
        for result in rv:
            json_data.append(dict(zip(result.keys(), result)))
        return Response(json.dumps(json_data), status=200)
    except Exception as e:
        return {'message': '{error}'.format(error=e)}, 400
    finally:
        connection.close()

The problem with the output is the way JSON is returned and the size of it. At first the API used to take 1minute 30seconds: when the return statement was like this:

case1: return Response(json.dumps(json_data), status=200, mimetype='application/json')

After looking online, I found that the above statement is trying to prettify JSON. So I removed mimetype from the response & made it as

case2: return Response(json.dumps(json_data), status=200)

The API runs for 30seconds, although the JSON output is not aligned properly but its still JSON. I see the output size of the JSON returned from the API is close 20MB. I observed this on postman response:

Status: 200 OK    Time: 29s    Size: 19MB

The difference in Json output:

case1:

[   {
        "col1":"val1",
        "col2":"val2"
    },
    {
        "col1":"val1",
        "col2":"val2"
    }
]

case2:

[{"col1":"val1","col2":"val2"},{"col1":"val1","col2":"val2"}]

Will the difference in output from the two aforementioned cases are different? If so, how can I fix the problem? If there is no difference, is there any way I speed up this further and reduce the run time further more, like compressing the JSON which I am returning?

Answer 1

You can use gzip compression to make your plain text weight from Megabytes to even Kilobytes. Or even use flask-compress library for that.
Also I'd suggest to use ujson to make dump() call faster.

import gzip

from flask import make_response
import ujson as json


@app.route('/data.json')
def compress():
    compression_level = 5  # of 9 max
    data = [
        {"col1": "val1", "col2": "val2"},
        {"col1": "val1", "col2": "val2"}
    ]
    content = gzip.compress(json.dumps(data).encode('utf8'), compression_level)
    response = make_response(content)
    response.headers['Content-length'] = len(content)
    response.headers['Content-Encoding'] = 'gzip'
    return response

Documentation:

Answer 2

First of all, profile: if 90% the time is being spent transferring across the network then optimising processing speed is less useful than optimising transfer speed (for example, by compressing the response as wowkin recommended (though the web server may be configured to do this automatically, if you are using one)

Assuming that constructing the JSON is slow, if you control the database code you could use its JSON capabilities to serialise the data, and avoid doing it at the Python layer. For example,

SELECT col1, col2
FROM tbl
WHERE col3 > 42
FOR JSON AUTO

would give you

[
    {
        "col1": "foo",
        "col2": 1
    },
    {
        "col1": "bar",
        "col2": 2
    },
    ...
]

Nested structures can be created too, described in the docs.

If the requester only needs the data, return it as a download using flask's send_file feature and avoid the cost of constructing an HTML response:

from io import BytesIO
from flask import send_file

def call_proc(sql: str, connection):
    try:
        rv = connection.execute(sql)
        json_data = rv.fetchone()[0]
        # BytesIO expects encoded data; if you can get the server to encode 
        # the data instead it may be faster.
        encoded_json = json_data.encode('utf-8')
        buf = BytesIO(encoded_json)
        return send_file(buf, mimetype='application/json', as_attachment=True, conditional=True) 
    except Exception as e:
        return {'message': '{error}'.format(error=e)}, 400
    finally:
        connection.close()

Answer 3

You need to implement pagination on your API. 19MB is absurdly large and will lead to some very annoyed users.

gzip and clevererness with the JSON responses will sadly not be enough, you'll need to put in a bit more legwork.

Luckily, there's many pagination questions and answers , and Flasks modular approach to things will mean that someone probably wrote up a module that's applicable to your problem. I'd start off by re-implementing the method with an ORM. I heard thatsqlalchemy is quite good.

Answer 4

To answer your question:

1 - Both JSON are semantically identical. You can make use of http://www.jsondiff.com to compare two JSON.

2 - I would recommend you to make chunks of your data and send it across network.

This might help: https://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html

Answer 5

TL;DR; Try restructuring your JSON payload (ie change schema)

I see that you are constructing the JSON response in one of your APIs. Currently, your JSON payload looks something like:

[
  {
    "col0": "val00",
    "col1": "val01"
  },
  {
    "col0": "val10",
    "col1": "val11"
  }
  ...
]

I suggest you restructure it in such a way that each (first level) key in your JSON represents the entire column. So, for the above case, it will become something like:

{
  "col0": ["val00", "val10", "val20", ...],
  "col1": ["val01", "val11", "val21", ...]
}

Here are the results from some offline test I performed.

Experiment variables:

NUMBER_OF_COLUMNS = 10
NUMBER_OF_ROWS = 100000
LENGTH_OF_STR_DATA = 5

#!/usr/bin/env python3

import json

NUMBER_OF_COLUMNS = 10
NUMBER_OF_ROWS = 100000
LENGTH_OF_STR_DATA = 5

def get_column_name(id_): 
    return 'col%d' % id_ 

def random_data(): 
    import string 
    import random 
    return ''.join(random.choices(string.ascii_letters, k=LENGTH_OF_STR_DATA))

def get_row(): 
    return { 
        get_column_name(i): random_data() 
        for i in range(NUMBER_OF_COLUMNS) 
    }

# data1 has same schema as your JSON
data1 = [ 
    get_row() for _ in range(NUMBER_OF_ROWS) 
]

with open("/var/tmp/1.json", "w") as f: 
    json.dump(data1, f) 

def get_column(): 
    return [random_data() for _ in range(NUMBER_OF_ROWS)] 

# data2 has the new proposed schema, to help you reduce the size
data2 = { 
    get_column_name(i): get_column() 
    for i in range(NUMBER_OF_COLUMNS) 
}

with open("/var/tmp/2.json", "w") as f: 
    json.dump(data2, f)

Comparing sizes of the two JSONs:

$ du -h /var/tmp/1.json
17M

$ du -h /var/tmp/2.json
8.6M

In this case, it almost got reduced by half.

I would suggest you do the following:

First and foremost, profile your code to see the real culprit. If it is really the payload size, proceed further.
Try to change your JSON's schema (as suggested above)
Compress your payload before sending (either from your Flask WSGI app layer or your webserver level - if you are running your Flask app behind some production grade webserver like Apache or Nginx)

Answer 6

For large data that you can't paginate using something like ndjson (or any type of delimited record format) can really reduce the server resources needed since you'd be preventing holding the JSON object in memory. You would need to get access to the response stream to write each object/line to the response though.

The response

[   {
        "col1":"val1",
        "col2":"val2"
    },
    {
        "col1":"val1",
        "col2":"val2"
    }
]

Would end up looking like

{"col1":"val1","col2":"val2"}
{"col1":"val1","col2":"val2"}

This also has advantages on the client since you can parse and process each line on it's own as well.

If you aren't dealing with nested data structures responding with a CSV is going to be even smaller.

Answer 7

I want to note that there is a standard way to write a sequence of separate records in JSON, and it's described in RFC 7464 . For each record:

Write the record separator byte (0x1E).
Write the JSON record, which is a regular JSON document that can also contain inner line breaks, in UTF-8.
Write the line feed byte (0x0A).

(Note that the JSON text sequence format, as it's called, uses a more liberal syntax for parsing text sequences of this kind; see the RFC for details.)

In your example, the JSON text sequence would look as follows, where \x1E and \x0A are the record separator and line feed bytes, respectively:

 \x1E{"col1":"val1","col2":"val2"}\x0A\x1E{"col1":"val1","col2":"val2"}\x0A

Since the JSON text sequence format allows inner line breaks, you can write each JSON record as you naturally would, as in the following example:

 \x1E{
    "col1":"val1",
    "col2":"val2"}
 \x0A\x1E{
    "col1":"val1",
    "col2":"val2"
 }\x0A

Notice that the media type for JSON text sequences is not application/json , but application/json-seq ; see the RFC.

How to speed up returning a 20MB Json file from a Python-Flask application?

Question

7 answers

solution1
3 2020-12-16 09:54:08

solution2
2 2020-12-19 09:13:00

solution3
0 2020-12-18 15:21:42

solution4
0 2020-12-18 18:22:42

solution5
0 2020-12-19 08:37:24

solution6
0 2020-12-19 23:13:36

solution7
0 2020-12-20 16:20:50

How to speed up returning a 20MB Json file from a Python-Flask application?

Question

7 answers

solution1 3 2020-12-16 09:54:08

solution2 2 2020-12-19 09:13:00

solution3 0 2020-12-18 15:21:42

solution4 0 2020-12-18 18:22:42

solution5 0 2020-12-19 08:37:24

solution6 0 2020-12-19 23:13:36

solution7 0 2020-12-20 16:20:50

solution1
3 2020-12-16 09:54:08

solution2
2 2020-12-19 09:13:00

solution3
0 2020-12-18 15:21:42

solution4
0 2020-12-18 18:22:42

solution5
0 2020-12-19 08:37:24

solution6
0 2020-12-19 23:13:36

solution7
0 2020-12-20 16:20:50