简体   繁体   English

如何加快从 Python-Flask 应用程序返回 20MB Json 文件的速度?

[英]How to speed up returning a 20MB Json file from a Python-Flask application?

I am trying to call an API which in turn triggers a store procedure from our sqlserver database.我正在尝试调用 API 进而触发我们的 sqlserver 数据库中的存储过程。 This is how I coded it.我就是这样编码的。

class Api_Name(Resource):

    def __init__(self):
        pass

    @classmethod
    def get(self):
        try:
            engine = database_engine
            connection = engine.connect()
            sql = "DECLARE @return_value int EXEC @return_value = [dbname].[dbo].[proc_name])
            return call_proc(sql, apiname, starttime, connection)
        except Exception as e:
            return {'message': 'Proc execution failed with error => {error}'.format(error=e)}, 400
        pass

call_proc is the method where I return the JSON from database. call_proc是我从数据库返回 JSON 的方法。

def call_proc(sql: str, connection):
    try:
        json_data = []
        rv = connection.execute(sql)
        for result in rv:
            json_data.append(dict(zip(result.keys(), result)))
        return Response(json.dumps(json_data), status=200)
    except Exception as e:
        return {'message': '{error}'.format(error=e)}, 400
    finally:
        connection.close()

The problem with the output is the way JSON is returned and the size of it. output 的问题在于返回 JSON 的方式及其大小。 At first the API used to take 1minute 30seconds: when the return statement was like this:起初 API 需要 1 分 30 秒:当返回语句是这样的:

case1: return Response(json.dumps(json_data), status=200, mimetype='application/json')

After looking online, I found that the above statement is trying to prettify JSON.上网查了一下,发现上面的说法是想美化JSON。 So I removed mimetype from the response & made it as所以我从响应中删除了mimetype并将其设置为

case2: return Response(json.dumps(json_data), status=200)

The API runs for 30seconds, although the JSON output is not aligned properly but its still JSON. API 运行 30 秒,尽管 JSON output 没有正确对齐,但它仍然是 Z0ECD18ZA48。 I see the output size of the JSON returned from the API is close 20MB.我看到从 API 返回的 JSON 的 output 大小接近 20MB。 I observed this on postman response:我在 postman 响应中观察到这一点:

Status: 200 OK    Time: 29s    Size: 19MB

The difference in Json output: Json output的区别:

case1:情况1:

[   {
        "col1":"val1",
        "col2":"val2"
    },
    {
        "col1":"val1",
        "col2":"val2"
    }
]

case2:案例2:

[{"col1":"val1","col2":"val2"},{"col1":"val1","col2":"val2"}]

Will the difference in output from the two aforementioned cases are different? output与上述两种情况的区别会不同吗? If so, how can I fix the problem?如果是这样,我该如何解决这个问题? If there is no difference, is there any way I speed up this further and reduce the run time further more, like compressing the JSON which I am returning?如果没有区别,有什么办法可以进一步加快速度并进一步减少运行时间,比如压缩我要返回的 JSON?

You can use gzip compression to make your plain text weight from Megabytes to even Kilobytes.您可以使用gzip压缩来使纯文本的重量从兆字节到千字节。 Or even use flask-compress library for that.甚至为此使用烧瓶压缩库。
Also I'd suggest to use ujson to make dump() call faster.另外我建议使用ujson来使dump()调用更快。

import gzip

from flask import make_response
import ujson as json


@app.route('/data.json')
def compress():
    compression_level = 5  # of 9 max
    data = [
        {"col1": "val1", "col2": "val2"},
        {"col1": "val1", "col2": "val2"}
    ]
    content = gzip.compress(json.dumps(data).encode('utf8'), compression_level)
    response = make_response(content)
    response.headers['Content-length'] = len(content)
    response.headers['Content-Encoding'] = 'gzip'
    return response

Documentation:文档:

First of all, profile: if 90% the time is being spent transferring across the network then optimising processing speed is less useful than optimising transfer speed (for example, by compressing the response as wowkin recommended (though the web server may be configured to do this automatically, if you are using one)首先,配置文件:如果 90% 的时间用于通过网络传输,则优化处理速度不如优化传输速度有用(例如,按照wowkin 推荐的方式压缩响应(尽管 web 服务器可能配置为这会自动,如果您使用的是一个)

Assuming that constructing the JSON is slow, if you control the database code you could use its JSON capabilities to serialise the data, and avoid doing it at the Python layer.假设构建 JSON 很慢,如果您控制数据库代码,您可以使用它的JSON 功能来序列化数据,并避免在 ZA7F5F35426B9274173Z 层执行此操作。 For example,例如,

SELECT col1, col2
FROM tbl
WHERE col3 > 42
FOR JSON AUTO

would give you会给你

[
    {
        "col1": "foo",
        "col2": 1
    },
    {
        "col1": "bar",
        "col2": 2
    },
    ...
]

Nested structures can be created too, described in the docs.也可以创建嵌套结构,如文档中所述。

If the requester only needs the data, return it as a download using flask's send_file feature and avoid the cost of constructing an HTML response:如果请求者只需要数据,请使用烧瓶的send_file功能将其作为下载返回,并避免构建 HTML 响应的成本:

from io import BytesIO
from flask import send_file

def call_proc(sql: str, connection):
    try:
        rv = connection.execute(sql)
        json_data = rv.fetchone()[0]
        # BytesIO expects encoded data; if you can get the server to encode 
        # the data instead it may be faster.
        encoded_json = json_data.encode('utf-8')
        buf = BytesIO(encoded_json)
        return send_file(buf, mimetype='application/json', as_attachment=True, conditional=True) 
    except Exception as e:
        return {'message': '{error}'.format(error=e)}, 400
    finally:
        connection.close()

You need to implement pagination on your API.您需要在 API 上实现分页。 19MB is absurdly large and will lead to some very annoyed users. 19MB 大得离谱,会导致一些非常恼火的用户。

gzip and clevererness with the JSON responses will sadly not be enough, you'll need to put in a bit more legwork. gzipJSON响应的聪明才智是不够的,你需要做更多的工作。

Luckily, there's many pagination questions and answers , and Flasks modular approach to things will mean that someone probably wrote up a module that's applicable to your problem.幸运的是,有很多分页问题和答案,而 Flasks 模块化的处理方式意味着有人可能编写了一个适用于您的问题的模块。 I'd start off by re-implementing the method with an ORM.我将从使用 ORM 重新实现该方法开始。 I heard thatsqlalchemy is quite good.听说sqlalchemy相当不错。

To answer your question:要回答您的问题:

1 - Both JSON are semantically identical. 1 - JSON 在语义上是相同的。 You can make use of http://www.jsondiff.com to compare two JSON.您可以使用http://www.jsondiff.com来比较两个 JSON。

2 - I would recommend you to make chunks of your data and send it across network. 2 - 我建议您制作数据块并通过网络发送。

This might help: https://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html这可能会有所帮助: https://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html

TL;DR; TL;博士; Try restructuring your JSON payload (ie change schema)尝试重组您的 JSON 有效负载(即更改架构)

I see that you are constructing the JSON response in one of your APIs.我看到您正在您的一个 API 中构建 JSON 响应。 Currently, your JSON payload looks something like:目前,您的 JSON 有效负载如下所示:

[
  {
    "col0": "val00",
    "col1": "val01"
  },
  {
    "col0": "val10",
    "col1": "val11"
  }
  ...
]

I suggest you restructure it in such a way that each (first level) key in your JSON represents the entire column.我建议您以 JSON 中的每个(第一级)键代表整个列的方式对其进行重组。 So, for the above case, it will become something like:因此,对于上述情况,它将变为:

{
  "col0": ["val00", "val10", "val20", ...],
  "col1": ["val01", "val11", "val21", ...]
}

Here are the results from some offline test I performed.这是我执行的一些离线测试的结果。

Experiment variables:实验变量:

  • NUMBER_OF_COLUMNS = 10 NUMBER_OF_COLUMNS = 10
  • NUMBER_OF_ROWS = 100000 NUMBER_OF_ROWS = 100000
  • LENGTH_OF_STR_DATA = 5 LENGTH_OF_STR_DATA = 5
#!/usr/bin/env python3

import json

NUMBER_OF_COLUMNS = 10
NUMBER_OF_ROWS = 100000
LENGTH_OF_STR_DATA = 5

def get_column_name(id_): 
    return 'col%d' % id_ 

def random_data(): 
    import string 
    import random 
    return ''.join(random.choices(string.ascii_letters, k=LENGTH_OF_STR_DATA))

def get_row(): 
    return { 
        get_column_name(i): random_data() 
        for i in range(NUMBER_OF_COLUMNS) 
    }

# data1 has same schema as your JSON
data1 = [ 
    get_row() for _ in range(NUMBER_OF_ROWS) 
]

with open("/var/tmp/1.json", "w") as f: 
    json.dump(data1, f) 

def get_column(): 
    return [random_data() for _ in range(NUMBER_OF_ROWS)] 

# data2 has the new proposed schema, to help you reduce the size
data2 = { 
    get_column_name(i): get_column() 
    for i in range(NUMBER_OF_COLUMNS) 
}

with open("/var/tmp/2.json", "w") as f: 
    json.dump(data2, f) 

Comparing sizes of the two JSONs:比较两个 JSON 的大小:

$ du -h /var/tmp/1.json
17M

$ du -h /var/tmp/2.json
8.6M

In this case, it almost got reduced by half.在这种情况下,它几乎减少了一半。

I would suggest you do the following:我建议您执行以下操作:

  • First and foremost, profile your code to see the real culprit.首先,分析您的代码以查看真正的罪魁祸首。 If it is really the payload size, proceed further.如果它确实是有效负载大小,请继续。
  • Try to change your JSON's schema (as suggested above)尝试更改 JSON 的架构(如上所述)
  • Compress your payload before sending (either from your Flask WSGI app layer or your webserver level - if you are running your Flask app behind some production grade webserver like Apache or Nginx)在发送之前压缩您的有效负载(从您的 Flask WSGI 应用程序层或您的网络服务器级别 - 如果您在一些生产级网络服务器后面运行 Flask 应用程序,例如 ZE9713AE04A02A810D6F33DD956F4279 或 4Nginx)

For large data that you can't paginate using something like ndjson (or any type of delimited record format) can really reduce the server resources needed since you'd be preventing holding the JSON object in memory.对于无法使用ndjson (或任何类型的分隔记录格式)之类的内容进行分页的大数据,可以真正减少所需的服务器资源,因为您会阻止在 ZCD69B4957F06CD818DZBF3E26 中保存 JSON object You would need to get access to the response stream to write each object/line to the response though.您需要访问响应 stream 才能将每个对象/行写入响应。

The response响应

[   {
        "col1":"val1",
        "col2":"val2"
    },
    {
        "col1":"val1",
        "col2":"val2"
    }
]

Would end up looking like最终看起来像

{"col1":"val1","col2":"val2"}
{"col1":"val1","col2":"val2"}

This also has advantages on the client since you can parse and process each line on it's own as well.这对客户端也有好处,因为您也可以自己解析和处理每一行。

If you aren't dealing with nested data structures responding with a CSV is going to be even smaller.如果您不处理响应 CSV 的嵌套数据结构,那么它会更小。

I want to note that there is a standard way to write a sequence of separate records in JSON, and it's described in RFC 7464 .我想指出,有一种标准方法可以在 JSON 中编写一系列单独的记录,并在RFC 7464中进行了描述。 For each record:对于每条记录:

  1. Write the record separator byte (0x1E).写入记录分隔符字节 (0x1E)。
  2. Write the JSON record, which is a regular JSON document that can also contain inner line breaks, in UTF-8.在 UTF-8 中写入 JSON 记录,这是一个常规的 JSON 文档,也可以包含内部换行符。
  3. Write the line feed byte (0x0A).写入换行字节 (0x0A)。

(Note that the JSON text sequence format, as it's called, uses a more liberal syntax for parsing text sequences of this kind; see the RFC for details.) (请注意,所谓的 JSON 文本序列格式使用更自由的语法来解析此类文本序列;有关详细信息,请参阅 RFC。)

In your example, the JSON text sequence would look as follows, where \x1E and \x0A are the record separator and line feed bytes, respectively:在您的示例中,JSON 文本序列如下所示,其中\x1E\x0A分别是记录分隔符和换行字节:

 \x1E{"col1":"val1","col2":"val2"}\x0A\x1E{"col1":"val1","col2":"val2"}\x0A

Since the JSON text sequence format allows inner line breaks, you can write each JSON record as you naturally would, as in the following example:由于 JSON 文本序列格式允许内部换行符,因此您可以按自然方式编写每个 JSON 记录,如下例所示:

 \x1E{
    "col1":"val1",
    "col2":"val2"}
 \x0A\x1E{
    "col1":"val1",
    "col2":"val2"
 }\x0A

Notice that the media type for JSON text sequences is not application/json , but application/json-seq ;请注意,JSON 文本序列的媒体类型不是application/json ,而是application/json-seq see the RFC.请参阅 RFC。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM