[英]Improve speed of complex postgres query in rails app
我在我的應用程序中有一個可視化大量數據的視圖,在后端使用此查詢生成數據:
DataPoint Load (20394.8ms)
SELECT communities.id as com,
consumers.name as con,
array_agg(timestamp ORDER BY data_points.timestamp asc) as tims,
array_agg(consumption ORDER BY data_points.timestamp ASC) as cons
FROM "data_points"
INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id"
INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id"
INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id"
INNER JOIN "clusterings" ON "clusterings"."id" = "communities"."clustering_id"
WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)
AND "data_points"."interval_id" = $3
AND "clusterings"."id" = 1
GROUP BY communities.id, consumers.id
[["timestamp", "2015-11-20 09:23:00"], ["timestamp", "2015-11-27 09:23:00"], ["interval_id", 2]]
查詢大約需要20秒才能執行,這看起來有點過分。
生成查詢的代碼如下:
res = {}
DataPoint.joins(consumer: {communities: :clustering} )
.where('clusterings.id': self,
timestamp: chart_cookies[:start_date] .. chart_cookies[:end_date],
interval_id: chart_cookies[:interval_id])
.group('communities.id')
.group('consumers.id')
.select('communities.id as com, consumers.name as con',
'array_agg(timestamp ORDER BY data_points.timestamp asc) as tims',
'array_agg(consumption ORDER BY data_points.timestamp ASC) as cons')
.each do |d|
res[d.com] ||= {}
res[d.com][d.con] = d.tims.zip(d.cons)
res[d.com]["aggregate"] ||= d.tims.map{|t| [t,0]}
res[d.com]["aggregate"] = res[d.com]["aggregate"].zip(d.cons).map{|(a,b),d| [a,(b+d)]}
end
res
相關的數據庫模型如下:
create_table "data_points", force: :cascade do |t|
t.bigint "consumer_id"
t.bigint "interval_id"
t.datetime "timestamp"
t.float "consumption"
t.float "flexibility"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.index ["consumer_id"], name: "index_data_points_on_consumer_id"
t.index ["interval_id"], name: "index_data_points_on_interval_id"
t.index ["timestamp", "consumer_id", "interval_id"], name: "index_data_points_on_timestamp_and_consumer_id_and_interval_id", unique: true
t.index ["timestamp"], name: "index_data_points_on_timestamp"
end
create_table "consumers", force: :cascade do |t|
t.string "name"
t.string "location"
t.string "edms_id"
t.bigint "building_type_id"
t.bigint "connection_type_id"
t.float "location_x"
t.float "location_y"
t.string "feeder_id"
t.bigint "consumer_category_id"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.index ["building_type_id"], name: "index_consumers_on_building_type_id"
t.index ["connection_type_id"], name: "index_consumers_on_connection_type_id"
t.index ["consumer_category_id"], name: "index_consumers_on_consumer_category_id"
end
create_table "communities_consumers", id: false, force: :cascade do |t|
t.bigint "consumer_id", null: false
t.bigint "community_id", null: false
t.index ["community_id", "consumer_id"], name: "index_communities_consumers_on_community_id_and_consumer_id"
t.index ["consumer_id", "community_id"], name: "index_communities_consumers_on_consumer_id_and_community_id"
end
create_table "communities", force: :cascade do |t|
t.string "name"
t.text "description"
t.bigint "clustering_id"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.index ["clustering_id"], name: "index_communities_on_clustering_id"
end
create_table "clusterings", force: :cascade do |t|
t.string "name"
t.text "description"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
end
如何使查詢執行得更快? 是否可以重構查詢以簡化它,或者為數據庫模式添加一些額外的索引,以便花費更短的時間?
有趣的是,我在另一個視圖中使用的稍微簡化的查詢版本運行得更快,第一個請求只有1161.4ms,下面的請求只有41.6ms:
DataPoint Load (1161.4ms)
SELECT consumers.name as con,
array_agg(timestamp ORDER BY data_points.timestamp asc) as tims,
array_agg(consumption ORDER BY data_points.timestamp ASC) as cons
FROM "data_points"
INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id"
INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id"
INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id"
WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)
AND "data_points"."interval_id" = $3
AND "communities"."id" = 100 GROUP BY communities.id, consumers.name
[["timestamp", "2015-11-20 09:23:00"], ["timestamp", "2015-11-27 09:23:00"], ["interval_id", 2]]
使用命令EXPLAIN (ANALYZE, BUFFERS)
和dbconsole中的查詢,我得到以下輸出:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=12.31..7440.69 rows=246 width=57) (actual time=44.139..20474.015 rows=296 loops=1)
Group Key: communities.id, consumers.id
Buffers: shared hit=159692 read=6148105 written=209
-> Nested Loop (cost=12.31..7434.54 rows=246 width=57) (actual time=20.944..20436.806 rows=49728 loops=1)
Buffers: shared hit=159685 read=6148105 written=209
-> Nested Loop (cost=11.88..49.30 rows=1 width=49) (actual time=0.102..6.374 rows=296 loops=1)
Buffers: shared hit=988 read=208
-> Nested Loop (cost=11.73..41.12 rows=1 width=57) (actual time=0.084..4.443 rows=296 loops=1)
Buffers: shared hit=396 read=208
-> Merge Join (cost=11.58..40.78 rows=1 width=24) (actual time=0.075..1.365 rows=296 loops=1)
Merge Cond: (communities_consumers.community_id = communities.id)
Buffers: shared hit=5 read=7
-> Index Only Scan using index_communities_consumers_on_community_id_and_consumer_id on communities_consumers (cost=0.27..28.71 rows=296 width=16) (actual time=0.039..0.446 rows=296 loops=1)
Heap Fetches: 4
Buffers: shared hit=1 read=6
-> Sort (cost=11.31..11.31 rows=3 width=16) (actual time=0.034..0.213 rows=247 loops=1)
Sort Key: communities.id
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=4 read=1
-> Bitmap Heap Scan on communities (cost=4.17..11.28 rows=3 width=16) (actual time=0.026..0.027 rows=6 loops=1)
Recheck Cond: (clustering_id = 1)
Heap Blocks: exact=1
Buffers: shared hit=4 read=1
-> Bitmap Index Scan on index_communities_on_clustering_id (cost=0.00..4.17 rows=3 width=0) (actual time=0.020..0.020 rows=8 loops=1)
Index Cond: (clustering_id = 1)
Buffers: shared hit=3 read=1
-> Index Scan using consumers_pkey on consumers (cost=0.15..0.33 rows=1 width=33) (actual time=0.007..0.008 rows=1 loops=296)
Index Cond: (id = communities_consumers.consumer_id)
Buffers: shared hit=391 read=201
-> Index Only Scan using clusterings_pkey on clusterings (cost=0.15..8.17 rows=1 width=8) (actual time=0.004..0.005 rows=1 loops=296)
Index Cond: (id = 1)
Heap Fetches: 296
Buffers: shared hit=592
-> Index Scan using index_data_points_on_consumer_id on data_points (cost=0.44..7383.44 rows=180 width=24) (actual time=56.128..68.995 rows=168 loops=296)
Index Cond: (consumer_id = consumers.id)
Filter: (("timestamp" >= '2015-11-20 09:23:00'::timestamp without time zone) AND ("timestamp" <= '2015-11-27 09:23:00'::timestamp without time zone) AND (interval_id = 2))
Rows Removed by Filter: 76610
Buffers: shared hit=158697 read=6147897 written=209
Planning time: 1.811 ms
Execution time: 20474.330 ms
(40 rows)
bullet
gem返回以下警告:
USE eager loading detected
Community => [:communities_consumers]
Add to your finder: :includes => [:communities_consumers]
USE eager loading detected
Community => [:consumers]
Add to your finder: :includes => [:consumers]
使用clusterings表刪除聯接后,新的查詢計划如下:
EXPLAIN for: SELECT communities.id as com, consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 AND "communities"."clustering_id" = 1 GROUP BY communities.id, consumers.id [["timestamp", "2015-11-29 20:52:30.926247"], ["timestamp", "2015-12-06 20:52:30.926468"], ["interval_id", 2]]
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=10839.79..10846.42 rows=241 width=57)
-> Sort (cost=10839.79..10840.39 rows=241 width=57)
Sort Key: communities.id, consumers.id
-> Nested Loop (cost=7643.11..10830.26 rows=241 width=57)
-> Nested Loop (cost=11.47..22.79 rows=1 width=49)
-> Hash Join (cost=11.32..17.40 rows=1 width=16)
Hash Cond: (communities_consumers.community_id = communities.id)
-> Seq Scan on communities_consumers (cost=0.00..4.96 rows=296 width=16)
-> Hash (cost=11.28..11.28 rows=3 width=8)
-> Bitmap Heap Scan on communities (cost=4.17..11.28 rows=3 width=8)
Recheck Cond: (clustering_id = 1)
-> Bitmap Index Scan on index_communities_on_clustering_id (cost=0.00..4.17 rows=3 width=0)
Index Cond: (clustering_id = 1)
-> Index Scan using consumers_pkey on consumers (cost=0.15..5.38 rows=1 width=33)
Index Cond: (id = communities_consumers.consumer_id)
-> Bitmap Heap Scan on data_points (cost=7631.64..10805.72 rows=174 width=24)
Recheck Cond: ((consumer_id = consumers.id) AND ("timestamp" >= '2015-11-29 20:52:30.926247'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926468'::timestamp without time zone))
Filter: (interval_id = 2::bigint)
-> BitmapAnd (cost=7631.64..7631.64 rows=861 width=0)
-> Bitmap Index Scan on index_data_points_on_consumer_id (cost=0.00..1589.92 rows=76778 width=0)
Index Cond: (consumer_id = consumers.id)
-> Bitmap Index Scan on index_data_points_on_timestamp (cost=0.00..6028.58 rows=254814 width=0)
Index Cond: (("timestamp" >= '2015-11-29 20:52:30.926247'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926468'::timestamp without time zone))
(23 rows)
根據評論中的要求,這些是簡化查詢的查詢計划,包括和不包含對communities.id
的限制
DataPoint Load (1563.3ms) SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]]
EXPLAIN for: SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]]
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=140992.34..142405.51 rows=51388 width=49)
-> Sort (cost=140992.34..141120.81 rows=51388 width=49)
Sort Key: communities.id, consumers.name
-> Hash Join (cost=10135.44..135214.45 rows=51388 width=49)
Hash Cond: (data_points.consumer_id = consumers.id)
-> Bitmap Heap Scan on data_points (cost=10082.58..134455.00 rows=51388 width=24)
Recheck Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint))
-> Bitmap Index Scan on index_data_points_on_timestamp_and_consumer_id_and_interval_id (cost=0.00..10069.74 rows=51388 width=0)
Index Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint))
-> Hash (cost=49.16..49.16 rows=296 width=49)
-> Hash Join (cost=33.06..49.16 rows=296 width=49)
Hash Cond: (communities_consumers.community_id = communities.id)
-> Hash Join (cost=8.66..20.69 rows=296 width=49)
Hash Cond: (consumers.id = communities_consumers.consumer_id)
-> Seq Scan on consumers (cost=0.00..7.96 rows=296 width=33)
-> Hash (cost=4.96..4.96 rows=296 width=16)
-> Seq Scan on communities_consumers (cost=0.00..4.96 rows=296 width=16)
-> Hash (cost=16.40..16.40 rows=640 width=8)
-> Seq Scan on communities (cost=0.00..16.40 rows=640 width=8)
(19 rows)
和
DataPoint Load (1479.0ms) SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]]
EXPLAIN for: SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]]
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=140992.34..142405.51 rows=51388 width=49)
-> Sort (cost=140992.34..141120.81 rows=51388 width=49)
Sort Key: communities.id, consumers.name
-> Hash Join (cost=10135.44..135214.45 rows=51388 width=49)
Hash Cond: (data_points.consumer_id = consumers.id)
-> Bitmap Heap Scan on data_points (cost=10082.58..134455.00 rows=51388 width=24)
Recheck Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint))
-> Bitmap Index Scan on index_data_points_on_timestamp_and_consumer_id_and_interval_id (cost=0.00..10069.74 rows=51388 width=0)
Index Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint))
-> Hash (cost=49.16..49.16 rows=296 width=49)
-> Hash Join (cost=33.06..49.16 rows=296 width=49)
Hash Cond: (communities_consumers.community_id = communities.id)
-> Hash Join (cost=8.66..20.69 rows=296 width=49)
Hash Cond: (consumers.id = communities_consumers.consumer_id)
-> Seq Scan on consumers (cost=0.00..7.96 rows=296 width=33)
-> Hash (cost=4.96..4.96 rows=296 width=16)
-> Seq Scan on communities_consumers (cost=0.00..4.96 rows=296 width=16)
-> Hash (cost=16.40..16.40 rows=640 width=8)
-> Seq Scan on communities (cost=0.00..16.40 rows=640 width=8)
(19 rows)
您是否嘗試添加索引:
“data_points”.timestamp“+”data_points“.consumer_id”
要么
data_points“。僅限於.consumer_id?
您使用的是什么版本的Postgres? 在Postgres 10中,他們引入了本機表分區。 如果您的“data_points”表非常大,這可能會顯着加快您的查詢速度,因為您正在查看時間范圍:
WHERE (data_points.TIMESTAMP BETWEEN $1 AND $2)
您可以研究的一個策略是在“timestamp”字段的DATE值上添加分區。 然后修改您的查詢以包含一個額外的過濾器,以便分區開始:
WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)
AND (CAST("data_points"."timestamp" AS DATE) BETWEEN CAST($1 AS DATE) AND CAST($2 AS DATE))
AND "data_points"."interval_id" = $3
AND "data_points"."interval_id" = $3
AND "communities"."clustering_id" = 1
如果您的“data_points”表非常大且“Timestamp”過濾范圍很小,這應該會有所幫助,因為它會快速過濾掉不需要處理的行塊。
我沒有在Postgres做過這個,所以我不確定它是多么可行,有幫助,等等等等等等。 但它有待於研究:)
https://www.postgresql.org/docs/10/static/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE
你在clusterings_id上有外鍵嗎? 另外 - 嘗試改變你的狀況:
SELECT communities.id as com,
consumers.name as con,
array_agg(timestamp ORDER BY data_points.timestamp asc) as tims,
array_agg(consumption ORDER BY data_points.timestamp ASC) as cons
FROM "data_points"
INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id"
INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id"
INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id"
WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)
AND "data_points"."interval_id" = $3
AND "communities"."clustering_id" = 1
GROUP BY communities.id, consumers.id
您不需要加入clusterings
。 因此,請嘗試從查詢中刪除它,並使用communities.clustering_id = 1
替換它。 這應該在您的查詢計划中刪除3個步驟。 這可以為您節省最多,因為您的查詢計划在三個嵌套循環內對其進行了一些索引掃描。
您還可以嘗試調整聚合timestamp
的方式。 我假設你不需要在幾秒鍾內聚合它們?
我還刪除了"index_data_points_on_timestamp"
索引,因為你已經有了一個復合索引。 這幾乎沒用。 這應該可以提高您的寫入性能。
未使用data_points.timestamp上的索引,可能是由於:: timestamp轉換。
我想知道改變列數據類型或創建功能索引會有所幫助。
編輯 - 我想,你的CREATE TABLE中的日期時間是Rails選擇顯示Postgres時間戳數據類型的方式,因此可能沒有轉換發生。
然而,時間戳上的索引沒有被使用,但根據您的數據分布,這可能是優化器非常明智的選擇。
所以這里我們有Postgres 9.3和長查詢。 在查詢之前,您必須確保為數據庫提供最佳設置,並且適合您對磁盤的讀寫百分比,磁盤ssd類型或舊硬盤,並且不切換autovacuum,檢查表和索引的膨脹並且您對用於構建最佳計划的索引具有良好的選擇性。
檢查行中填充的行類型和大小。 更改行的類型也會減少表和時間的大小。
所以現在你確保這一切。 現在讓我們思考Postgres如何執行以及如何減少時間和精力。 ORM適用於簡單查詢,但如果您嘗試執行復雜查詢,則必須使用execute by sql
方法execute by sql
並保留在Query Service Objects
。
在sql中盡可能編寫更簡單的查詢Postgres也浪費時間進行解析查詢。
檢查所有連接字段上的索引。 使用explain analyze
檢查現在您是否擁有最佳掃描方法。
下一點。 你嘗試做4連接! Postgres嘗試在4中找到最佳查詢計划! 時間(4個階乘!)讓我們考慮使用具有預定義表的子查詢或表來進行此選擇。
對4個連接使用分離的查詢或函數(嘗試子查詢):
SELECT *
FROM "data_points" as predefined
INNER JOIN "consumers"
ON "consumers"."id" ="data_points"."consumer_id"
INNER JOIN "communities_consumers"
ON "communities_consumers"."consumer_id" = "consumers"."id"
INNER JOIN "communities"
ON "communities"."id" = "communities_consumers"."community_id"
INNER JOIN "clusterings"
ON "clusterings"."id" "communities"."clustering_id"
WHERE "data_points"."interval_id" = 2
AND "clusterings"."id" = 1
2)下一步(不要使用變量只是通過)
SELECT *
FROM predefined
WHERE "data_points"."timestamp"
BETWEEN "2015-11-20 09:23:00"
AND 2015-11-27 09:23:00
3)您有3次詢問data_points
進行查詢,您需要更少:
array_agg(timestamp ORDER BY data_points.timestamp asc) as tims
array_agg(consumption ORDER BY data_points.timestamp ASC) as cons
WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)
你應該記住長時間查詢它不是關於查詢,關於設置,ORM用法,sql以及Postgres如何使用它。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.