使用 jsonb 和 group by 进行慢 Postgres 查询

Question

我们有一个 Postgres 表，它将文档分析存储为 jsonb。 目前这个表有大约 10 万个条目，我正在尝试查询结果 - 不幸的是，查询比预期慢得多（> 4 秒）。 缓存和/或物化视图不是 IMO 的解决方案，因为用户有一堆过滤器（例如日期/类别），因此每个查询都会略有不同。

这是查询...

SELECT r.key AS key,
       SUM((r.value->>'counter')::int) AS count_key,
       COUNT(*) AS count_documents,
       json_agg(json_build_object(
           'date', data_reportfile.date::date,
           'count_key', (r.value->>'counter')::int )
       ) AS dates
  FROM data_reportfile,
       jsonb_each(analysis_result->'results') r
 WHERE data_reportfile.date >= '1960-1-1'
   AND data_reportfile.analysis_done IS TRUE
   AND r.value->>'category' = 'general'
 GROUP BY r.key
 ORDER BY count_documents DESC
 LIMIT 20;

这是 EXPLAIN ANALYZE 的结果......

  Limit  (cost=42442.85..42442.86 rows=1 width=80) (actual time=4338.407..4343.240 rows=20 loops=1)
   ->  Sort  (cost=42442.85..42442.86 rows=1 width=80) (actual time=4338.406..4338.413 rows=20 loops=1)
         Sort Key: (count(*)) DESC
         Sort Method: top-N heapsort  Memory: 10002kB
         ->  HashAggregate  (cost=42442.83..42442.84 rows=1 width=80) (actual time=4324.406..4332.789 rows=2704 loops=1)
               Group Key: r.key
               ->  Gather  (cost=1000.00..41071.38 rows=49871 width=68) (actual time=0.699..759.060 rows=911509 loops=1)
                     Workers Planned: 3
                     Workers Launched: 3
                     ->  Nested Loop  (cost=0.01..35084.28 rows=16087 width=68) (actual time=0.421..2317.619 rows=227877 loops=4)
                           ->  Parallel Seq Scan on data_reportfile  (cost=0.00..10792.90 rows=16087 width=291) (actual time=0.016..25.494 rows=12461 loops=4)
                                 Filter: ((analysis_done IS TRUE) AND (date >= '1960-01-01'::date))
                                 Rows Removed by Filter: 3901
                           ->  Function Scan on jsonb_each r  (cost=0.01..1.50 rows=1 width=64) (actual time=0.171..0.179 rows=18 loops=49843)
                                 Filter: ((value ->> 'category'::text) = 'general'::text)
                                 Rows Removed by Filter: 8
 Planning time: 0.239 ms
 Execution time: 4353.262 ms

我不太确定，但似乎分组非常昂贵。 我想索引在这里没有多大帮助（已经定义了date索引）。 我已经调整了一些 Postgres 设置（ work_mem 、 shared_buffers ）而没有任何明显的效果。

知道我现在可以尝试什么吗？ 或者我是否只需要忍受缓慢的查询，因为我正在努力实现。

更新 1（示例数据）：

这是列analysis_result一个示例。

{
  "country": "FRA",
  "docInfo": {},
  "results": {
    "FRA": {
      "counter": 6,
      "category": "geographic",
      "positions": [
        "26, 21, 65, 71",
        "28, 23, 58, 64",
        "93, 68, 9, 15",
        "106, 79, 160, 166",
        "158, 117, 10, 16",
        "158, 117, 47, 53"
      ],
      "sentences": [],
      "sentiment": [
        "0.0, 0.902, 0.098, 0.4404",
        "0.0, 1.0, 0.0, 0.0",
        "0.041, 0.959, 0.0, -0.128",
        "0.047, 0.799, 0.154, 0.5563",
        "0.0, 1.0, 0.0, 0.0"
      ]
    },
    "Debt": {
      "counter": 2,
      "category": "general",
      "positions": [
        "161, 119, 15, 19",
        "166, 121, 15, 19"
      ],
      "sentences": [],
      "sentiment": [
        "0.172, 0.828, 0.0, -0.3612",
        "0.179, 0.619, 0.203, 0.1779"
      ]
    }
  },
  "docPages": 12,
  "language": "en",
  "counter_words": 1382,
  "counter_tokens": 3591,
  "counter_sentences_final": 123,
  "counter_sentences_total": 169
}

更新 2（结果）：

这就是我想要得到的......它基本上是一个带有每个日期（例如年份）计数器的关键字列表。

[
  {
    "key": "Risk",
    "count_key": 283522,
    "count_documents": 22298,
    "dates": [
      {
          "date": "2021",
          "count_key": 228615
      },
      {
          "date": "2020",
          "count_key": 4691
      }
    ]
  },
  {
    "key": "Debt",
    "count_key": 283522,
    "count_documents": 22298,
    "dates": [
      {
          "date": "2021",
          "count_key": 228615
      },
      {
          "date": "2020",
          "count_key": 4691
      }
    ]
  }
]

Answer 1

是的，计算键控到 2704 个值（来自 911509 行）的巨型 agg，只是将除 20 之外的所有值都扔掉，需要大量的工作。 索引无济于事，因为您无论如何都要处理几乎所有数据。

由于“一堆过滤器（例如日期/类别）”，您会看到每个查询都不同，但您的示例数据中似乎只有一个有意义的类别，您可以对需要重新计算的 jsonb_agg 的日期做什么每个不同的？

使用 jsonb 和 group by 进行慢 Postgres 查询

问题描述

1 个解决方案

解决方案1
0 2021-10-15 21:07:31

使用 jsonb 和 group by 进行慢 Postgres 查询

问题描述

1 个解决方案

解决方案1 0 2021-10-15 21:07:31

解决方案1
0 2021-10-15 21:07:31