[英]How to speed up returning a 20MB Json file from a Python-Flask application?
I am trying to call an API which in turn triggers a store procedure from our sqlserver database.我正在尝试调用 API 进而触发我们的 sqlserver 数据库中的存储过程。 This is how I coded it.
我就是这样编码的。
class Api_Name(Resource):
def __init__(self):
pass
@classmethod
def get(self):
try:
engine = database_engine
connection = engine.connect()
sql = "DECLARE @return_value int EXEC @return_value = [dbname].[dbo].[proc_name])
return call_proc(sql, apiname, starttime, connection)
except Exception as e:
return {'message': 'Proc execution failed with error => {error}'.format(error=e)}, 400
pass
call_proc
is the method where I return the JSON from database. call_proc
是我从数据库返回 JSON 的方法。
def call_proc(sql: str, connection):
try:
json_data = []
rv = connection.execute(sql)
for result in rv:
json_data.append(dict(zip(result.keys(), result)))
return Response(json.dumps(json_data), status=200)
except Exception as e:
return {'message': '{error}'.format(error=e)}, 400
finally:
connection.close()
The problem with the output is the way JSON is returned and the size of it. output 的问题在于返回 JSON 的方式及其大小。 At first the API used to take 1minute 30seconds: when the return statement was like this:
起初 API 需要 1 分 30 秒:当返回语句是这样的:
case1: return Response(json.dumps(json_data), status=200, mimetype='application/json')
After looking online, I found that the above statement is trying to prettify JSON.上网查了一下,发现上面的说法是想美化JSON。 So I removed
mimetype
from the response & made it as所以我从响应中删除了
mimetype
并将其设置为
case2: return Response(json.dumps(json_data), status=200)
The API runs for 30seconds, although the JSON output is not aligned properly but its still JSON. API 运行 30 秒,尽管 JSON output 没有正确对齐,但它仍然是 Z0ECD18ZA48。 I see the output size of the JSON returned from the API is close 20MB.
我看到从 API 返回的 JSON 的 output 大小接近 20MB。 I observed this on postman response:
我在 postman 响应中观察到这一点:
Status: 200 OK Time: 29s Size: 19MB
The difference in Json output: Json output的区别:
case1:情况1:
[ {
"col1":"val1",
"col2":"val2"
},
{
"col1":"val1",
"col2":"val2"
}
]
case2:案例2:
[{"col1":"val1","col2":"val2"},{"col1":"val1","col2":"val2"}]
Will the difference in output from the two aforementioned cases are different? output与上述两种情况的区别会不同吗? If so, how can I fix the problem?
如果是这样,我该如何解决这个问题? If there is no difference, is there any way I speed up this further and reduce the run time further more, like compressing the JSON which I am returning?
如果没有区别,有什么办法可以进一步加快速度并进一步减少运行时间,比如压缩我要返回的 JSON?
You can use gzip
compression to make your plain text weight from Megabytes to even Kilobytes.您可以使用
gzip
压缩来使纯文本的重量从兆字节到千字节。 Or even use flask-compress library for that.甚至为此使用烧瓶压缩库。
Also I'd suggest to use ujson to make dump()
call faster.另外我建议使用ujson来使
dump()
调用更快。
import gzip
from flask import make_response
import ujson as json
@app.route('/data.json')
def compress():
compression_level = 5 # of 9 max
data = [
{"col1": "val1", "col2": "val2"},
{"col1": "val1", "col2": "val2"}
]
content = gzip.compress(json.dumps(data).encode('utf8'), compression_level)
response = make_response(content)
response.headers['Content-length'] = len(content)
response.headers['Content-Encoding'] = 'gzip'
return response
Documentation:文档:
First of all, profile: if 90% the time is being spent transferring across the network then optimising processing speed is less useful than optimising transfer speed (for example, by compressing the response as wowkin recommended (though the web server may be configured to do this automatically, if you are using one)首先,配置文件:如果 90% 的时间用于通过网络传输,则优化处理速度不如优化传输速度有用(例如,按照wowkin 推荐的方式压缩响应(尽管 web 服务器可能配置为这会自动,如果您使用的是一个)
Assuming that constructing the JSON is slow, if you control the database code you could use its JSON capabilities to serialise the data, and avoid doing it at the Python layer.假设构建 JSON 很慢,如果您控制数据库代码,您可以使用它的JSON 功能来序列化数据,并避免在 ZA7F5F35426B9274173Z 层执行此操作。 For example,
例如,
SELECT col1, col2
FROM tbl
WHERE col3 > 42
FOR JSON AUTO
would give you会给你
[
{
"col1": "foo",
"col2": 1
},
{
"col1": "bar",
"col2": 2
},
...
]
Nested structures can be created too, described in the docs.也可以创建嵌套结构,如文档中所述。
If the requester only needs the data, return it as a download using flask's send_file feature and avoid the cost of constructing an HTML response:如果请求者只需要数据,请使用烧瓶的send_file功能将其作为下载返回,并避免构建 HTML 响应的成本:
from io import BytesIO
from flask import send_file
def call_proc(sql: str, connection):
try:
rv = connection.execute(sql)
json_data = rv.fetchone()[0]
# BytesIO expects encoded data; if you can get the server to encode
# the data instead it may be faster.
encoded_json = json_data.encode('utf-8')
buf = BytesIO(encoded_json)
return send_file(buf, mimetype='application/json', as_attachment=True, conditional=True)
except Exception as e:
return {'message': '{error}'.format(error=e)}, 400
finally:
connection.close()
You need to implement pagination on your API.您需要在 API 上实现分页。 19MB is absurdly large and will lead to some very annoyed users.
19MB 大得离谱,会导致一些非常恼火的用户。
gzip
and clevererness with the JSON
responses will sadly not be enough, you'll need to put in a bit more legwork. gzip
和JSON
响应的聪明才智是不够的,你需要做更多的工作。
Luckily, there's many pagination questions and answers , and Flasks modular approach to things will mean that someone probably wrote up a module that's applicable to your problem.幸运的是,有很多分页问题和答案,而 Flasks 模块化的处理方式意味着有人可能编写了一个适用于您的问题的模块。 I'd start off by re-implementing the method with an ORM.
我将从使用 ORM 重新实现该方法开始。 I heard thatsqlalchemy is quite good.
听说sqlalchemy相当不错。
To answer your question:要回答您的问题:
1 - Both JSON are semantically identical. 1 - JSON 在语义上是相同的。 You can make use of http://www.jsondiff.com to compare two JSON.
您可以使用http://www.jsondiff.com来比较两个 JSON。
2 - I would recommend you to make chunks of your data and send it across network. 2 - 我建议您制作数据块并通过网络发送。
This might help: https://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html这可能会有所帮助: https://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html
TL;DR; TL;博士; Try restructuring your JSON payload (ie change schema)
尝试重组您的 JSON 有效负载(即更改架构)
I see that you are constructing the JSON response in one of your APIs.我看到您正在您的一个 API 中构建 JSON 响应。 Currently, your JSON payload looks something like:
目前,您的 JSON 有效负载如下所示:
[
{
"col0": "val00",
"col1": "val01"
},
{
"col0": "val10",
"col1": "val11"
}
...
]
I suggest you restructure it in such a way that each (first level) key in your JSON represents the entire column.我建议您以 JSON 中的每个(第一级)键代表整个列的方式对其进行重组。 So, for the above case, it will become something like:
因此,对于上述情况,它将变为:
{
"col0": ["val00", "val10", "val20", ...],
"col1": ["val01", "val11", "val21", ...]
}
Here are the results from some offline test I performed.这是我执行的一些离线测试的结果。
Experiment variables:实验变量:
#!/usr/bin/env python3
import json
NUMBER_OF_COLUMNS = 10
NUMBER_OF_ROWS = 100000
LENGTH_OF_STR_DATA = 5
def get_column_name(id_):
return 'col%d' % id_
def random_data():
import string
import random
return ''.join(random.choices(string.ascii_letters, k=LENGTH_OF_STR_DATA))
def get_row():
return {
get_column_name(i): random_data()
for i in range(NUMBER_OF_COLUMNS)
}
# data1 has same schema as your JSON
data1 = [
get_row() for _ in range(NUMBER_OF_ROWS)
]
with open("/var/tmp/1.json", "w") as f:
json.dump(data1, f)
def get_column():
return [random_data() for _ in range(NUMBER_OF_ROWS)]
# data2 has the new proposed schema, to help you reduce the size
data2 = {
get_column_name(i): get_column()
for i in range(NUMBER_OF_COLUMNS)
}
with open("/var/tmp/2.json", "w") as f:
json.dump(data2, f)
Comparing sizes of the two JSONs:比较两个 JSON 的大小:
$ du -h /var/tmp/1.json
17M
$ du -h /var/tmp/2.json
8.6M
In this case, it almost got reduced by half.在这种情况下,它几乎减少了一半。
I would suggest you do the following:我建议您执行以下操作:
For large data that you can't paginate using something like ndjson (or any type of delimited record format) can really reduce the server resources needed since you'd be preventing holding the JSON object in memory.对于无法使用ndjson (或任何类型的分隔记录格式)之类的内容进行分页的大数据,可以真正减少所需的服务器资源,因为您会阻止在 ZCD69B4957F06CD818DZBF3E26 中保存 JSON object You would need to get access to the response stream to write each object/line to the response though.
您需要访问响应 stream 才能将每个对象/行写入响应。
The response响应
[ {
"col1":"val1",
"col2":"val2"
},
{
"col1":"val1",
"col2":"val2"
}
]
Would end up looking like最终看起来像
{"col1":"val1","col2":"val2"}
{"col1":"val1","col2":"val2"}
This also has advantages on the client since you can parse and process each line on it's own as well.这对客户端也有好处,因为您也可以自己解析和处理每一行。
If you aren't dealing with nested data structures responding with a CSV is going to be even smaller.如果您不处理响应 CSV 的嵌套数据结构,那么它会更小。
I want to note that there is a standard way to write a sequence of separate records in JSON, and it's described in RFC 7464 .我想指出,有一种标准方法可以在 JSON 中编写一系列单独的记录,并在RFC 7464中进行了描述。 For each record:
对于每条记录:
(Note that the JSON text sequence format, as it's called, uses a more liberal syntax for parsing text sequences of this kind; see the RFC for details.) (请注意,所谓的 JSON 文本序列格式使用更自由的语法来解析此类文本序列;有关详细信息,请参阅 RFC。)
In your example, the JSON text sequence would look as follows, where \x1E
and \x0A
are the record separator and line feed bytes, respectively:在您的示例中,JSON 文本序列如下所示,其中
\x1E
和\x0A
分别是记录分隔符和换行字节:
\x1E{"col1":"val1","col2":"val2"}\x0A\x1E{"col1":"val1","col2":"val2"}\x0A
Since the JSON text sequence format allows inner line breaks, you can write each JSON record as you naturally would, as in the following example:由于 JSON 文本序列格式允许内部换行符,因此您可以按自然方式编写每个 JSON 记录,如下例所示:
\x1E{
"col1":"val1",
"col2":"val2"}
\x0A\x1E{
"col1":"val1",
"col2":"val2"
}\x0A
Notice that the media type for JSON text sequences is not application/json
, but application/json-seq
;请注意,JSON 文本序列的媒体类型不是
application/json
,而是application/json-seq
; see the RFC.请参阅 RFC。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.