[英]Slow Postgres query with jsonb and group by
我们有一个 Postgres 表,它将文档分析存储为 jsonb。 目前这个表有大约 10 万个条目,我正在尝试查询结果 - 不幸的是,查询比预期慢得多(> 4 秒)。 缓存和/或物化视图不是 IMO 的解决方案,因为用户有一堆过滤器(例如日期/类别),因此每个查询都会略有不同。
这是查询...
SELECT r.key AS key,
SUM((r.value->>'counter')::int) AS count_key,
COUNT(*) AS count_documents,
json_agg(json_build_object(
'date', data_reportfile.date::date,
'count_key', (r.value->>'counter')::int )
) AS dates
FROM data_reportfile,
jsonb_each(analysis_result->'results') r
WHERE data_reportfile.date >= '1960-1-1'
AND data_reportfile.analysis_done IS TRUE
AND r.value->>'category' = 'general'
GROUP BY r.key
ORDER BY count_documents DESC
LIMIT 20;
这是 EXPLAIN ANALYZE 的结果......
Limit (cost=42442.85..42442.86 rows=1 width=80) (actual time=4338.407..4343.240 rows=20 loops=1)
-> Sort (cost=42442.85..42442.86 rows=1 width=80) (actual time=4338.406..4338.413 rows=20 loops=1)
Sort Key: (count(*)) DESC
Sort Method: top-N heapsort Memory: 10002kB
-> HashAggregate (cost=42442.83..42442.84 rows=1 width=80) (actual time=4324.406..4332.789 rows=2704 loops=1)
Group Key: r.key
-> Gather (cost=1000.00..41071.38 rows=49871 width=68) (actual time=0.699..759.060 rows=911509 loops=1)
Workers Planned: 3
Workers Launched: 3
-> Nested Loop (cost=0.01..35084.28 rows=16087 width=68) (actual time=0.421..2317.619 rows=227877 loops=4)
-> Parallel Seq Scan on data_reportfile (cost=0.00..10792.90 rows=16087 width=291) (actual time=0.016..25.494 rows=12461 loops=4)
Filter: ((analysis_done IS TRUE) AND (date >= '1960-01-01'::date))
Rows Removed by Filter: 3901
-> Function Scan on jsonb_each r (cost=0.01..1.50 rows=1 width=64) (actual time=0.171..0.179 rows=18 loops=49843)
Filter: ((value ->> 'category'::text) = 'general'::text)
Rows Removed by Filter: 8
Planning time: 0.239 ms
Execution time: 4353.262 ms
我不太确定,但似乎分组非常昂贵。 我想索引在这里没有多大帮助(已经定义了date
索引)。 我已经调整了一些 Postgres 设置( work_mem
、 shared_buffers
)而没有任何明显的效果。
知道我现在可以尝试什么吗? 或者我是否只需要忍受缓慢的查询,因为我正在努力实现。
更新 1(示例数据):
这是列analysis_result
一个示例。
{
"country": "FRA",
"docInfo": {},
"results": {
"FRA": {
"counter": 6,
"category": "geographic",
"positions": [
"26, 21, 65, 71",
"28, 23, 58, 64",
"93, 68, 9, 15",
"106, 79, 160, 166",
"158, 117, 10, 16",
"158, 117, 47, 53"
],
"sentences": [],
"sentiment": [
"0.0, 0.902, 0.098, 0.4404",
"0.0, 1.0, 0.0, 0.0",
"0.041, 0.959, 0.0, -0.128",
"0.047, 0.799, 0.154, 0.5563",
"0.0, 1.0, 0.0, 0.0"
]
},
"Debt": {
"counter": 2,
"category": "general",
"positions": [
"161, 119, 15, 19",
"166, 121, 15, 19"
],
"sentences": [],
"sentiment": [
"0.172, 0.828, 0.0, -0.3612",
"0.179, 0.619, 0.203, 0.1779"
]
}
},
"docPages": 12,
"language": "en",
"counter_words": 1382,
"counter_tokens": 3591,
"counter_sentences_final": 123,
"counter_sentences_total": 169
}
更新 2(结果):
这就是我想要得到的......它基本上是一个带有每个日期(例如年份)计数器的关键字列表。
[
{
"key": "Risk",
"count_key": 283522,
"count_documents": 22298,
"dates": [
{
"date": "2021",
"count_key": 228615
},
{
"date": "2020",
"count_key": 4691
}
]
},
{
"key": "Debt",
"count_key": 283522,
"count_documents": 22298,
"dates": [
{
"date": "2021",
"count_key": 228615
},
{
"date": "2020",
"count_key": 4691
}
]
}
]
是的,计算键控到 2704 个值(来自 911509 行)的巨型 agg,只是将除 20 之外的所有值都扔掉,需要大量的工作。 索引无济于事,因为您无论如何都要处理几乎所有数据。
由于“一堆过滤器(例如日期/类别)”,您会看到每个查询都不同,但您的示例数据中似乎只有一个有意义的类别,您可以对需要重新计算的 jsonb_agg 的日期做什么每个不同的?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.