简体   繁体   中英

How can I optimise this complex PostgreSQL query

I have been given a complex PostgreSQL query, and been asked to optimise it. I've made some progress in establishing where the problem might be, however I'm running out of ideas for how to approach optimising it. I'm hoping for some pointers/suggestions.

This query is part of an "AI Chatbot" system, it returns the number of unique conversations had with the chatbot, with a few filters. The data returned looks like:

conversationId userId userName channel eventCount operator firstMessageDate lastMessageDate totalUniqueUsers
Id of the conversation Id of the user Name of the user Number of events that occured during conversation If true, the conversation was with an "admin" or inhouse person Date of the first message in the conversation Date of the last message in the conversatrion Total number of unique user conversations

The problem with the query seems to be related to a specific 'inner join' clause, joining on a large (10540420 rows) table.

This is the query with the inner join commented out, this way it will run in around 2 seconds.

select 
  "conversation_table"."conversationid" as "conversationId", 
  "du"."userid" as "userId", 
  "du"."username" as "userName", 
  "du"."channel" as "channel", 
  COALESCE(user_events.count, 0):: integer as "eventCount", 
  BOOL_OR(dc.operator_messages > 0) as "operator", 
  DATE_PART(
    'epoch', dc.first_message_timestamp
  )* 1000 as "firstMessageDate", 
  DATE_PART(
    'epoch', dc.last_message_timestamp
  )* 1000 as "lastMessageDate", 
  COUNT(*) OVER():: integer as "totalUniqueUsers" 
from 
  "fact_conversation_data" as "conversation_table" 
  inner join "dim_user" as "du" on "conversation_table"."dim_user_user_id" = "du"."user_id" 
  inner join "dim_time" as "dt" on "conversation_table"."dim_time_time_id" = "dt"."time_id" 
  inner join "dim_conversation" as "dc" on "conversation_table"."conversationid" = "dc"."conversationid" 
--   inner join (
--     select 
--       "conversationid" 
--     from 
--       "fact_milestone_event" as fme
--     where 
--       fme.dim_segment_segment_id in ('20736b82-4515-411f-9bc8-cf4d84ad69ac')
--       group by "conversationid"
--   ) as "fme" on "fme"."conversationid" = "conversation_table"."conversationid" 
  left join (
    select 
      "fme"."conversationid", 
      count("fme"."event_id") 
    from 
      "fact_milestone_event" as "fme" 
    where 
      "fme"."timestamp" >= '2022-12-18 11:00:00.000' 
      and "fme"."timestamp" <= '2023-01-18 10:59:59.999' 
      and "fme"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9' 
    group by 
      "fme"."conversationid"
  ) as "user_events" on "conversation_table"."conversationid" = "user_events"."conversationid" 
where 
  "conversation_table"."timestamp" >= '2022-12-18 11:00:00.000' 
  and "conversation_table"."timestamp" <= '2023-01-18 10:59:59.999' 
  and "dt"."bot_zone" = 'Pacific/Auckland' 
  and "du"."is_platform_user" <> true 
  and "conversation_table"."conversationid" in (
    select 
      "notEmptyConservation_table"."conversationid" 
    from 
      (
        select 
          sum("user_messages") as "total_user_message_count", 
          "conversationid" 
        from 
          "fact_conversation_data" 
        where 
          "fact_conversation_data"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9' 
          and "fact_conversation_data"."timestamp" >= '2022-12-18 11:00:00.000' 
          and "fact_conversation_data"."timestamp" <= '2023-01-18 10:59:59.999' 
        group by 
          "conversationid"
      ) as "notEmptyConservation_table" 
    where 
      "notEmptyConservation_table"."total_user_message_count" > 1
  ) 
  and "conversation_table"."dim_tenant_tenant_id" = '4621ed8f-d8a4-46e2-a8de-5710751b16b9' 
  and (
    "conversation_table"."user_messages" > 0
  ) 
group by 
  "user_events"."count", 
  "conversation_table"."conversationid", 
  "du"."userid", 
  "du"."username", 
  "du"."channel", 
  "dc"."first_message_timestamp", 
  "dc"."last_message_timestamp" 
order by 
  "lastMessageDate" asc 
limit 
  20

However, if I uncomment the inner join, I get a runtime of about 9 minutes.

Here's a link to the SLOW query and execution plan in explain.depesz.com.

Here's a link for the faster version of the query, without the problematic inner join.

Here are the indexes for the two main tables involved, fact_milestone_event and fact_conversation_data .

Table Name Index Name Index Definition
fact_conversation_data fact_conversation_data_conversationid_idx CREATE INDEX fact_conversation_data_conversationid_idx ON public.fact_conversation_data USING btree (conversationid)
fact_conversation_data fact_conversation_data_tenant_id_idx CREATE INDEX fact_conversation_data_tenant_id_idx ON public.fact_conversation_data USING btree (dim_tenant_tenant_id)
fact_conversation_data fact_conversation_data_timestamp_idx CREATE INDEX fact_conversation_data_timestamp_idx ON public.fact_conversation_data USING btree (timestamp)
fact_conversation_data conversationid_timeid_unique_idx CREATE UNIQUE INDEX conversationid_timeid_unique_idx ON public.fact_conversation_data USING btree (conversationid, dim_time_time_id)
fact_conversation_data fact_conversation_data_pk CREATE UNIQUE INDEX fact_conversation_data_pk ON public.fact_conversation_data USING btree (conversation_data_id)
fact_milestone_event fact_milestone_event_dim_segment_segment_id_idx CREATE INDEX fact_milestone_event_dim_segment_segment_id_idx ON public.fact_milestone_event USING btree (dim_segment_segment_id)
fact_milestone_event fact_milestone_event_time_id_idx CREATE INDEX fact_milestone_event_time_id_idx ON public.fact_milestone_event USING btree (dim_time_time_id)
fact_milestone_event fact_milestone_event_dim_milestone_milestone_id_idx CREATE INDEX fact_milestone_event_dim_milestone_milestone_id_idx ON public.fact_milestone_event USING btree (dim_milestone_milestone_id)
fact_milestone_event fact_milestone_event_conversationid_idx CREATE INDEX fact_milestone_event_conversationid_idx ON public.fact_milestone_event USING btree (conversationid)
fact_milestone_event fact_milestone_event_timestamp_idx CREATE INDEX fact_milestone_event_timestamp_idx ON public.fact_milestone_event USING btree (timestamp)
fact_milestone_event fact_milestone_event_dim_tenant_tenant_id_idx CREATE INDEX fact_milestone_event_dim_tenant_tenant_id_idx ON public.fact_milestone_event USING btree (dim_tenant_tenant_id)
fact_milestone_event fact_milestone_event_dim_user_user_id_idx CREATE INDEX fact_milestone_event_dim_user_user_id_idx ON public.fact_milestone_event USING btree (dim_user_user_id)
fact_milestone_event fact_milestone_event_pk CREATE UNIQUE INDEX fact_milestone_event_pk ON public.fact_milestone_event USING btree (event_id)

I've tried replacing the inner join with a WHERE clause with a nested select. I've tried running EXPLAIN against the query, but had trouble understanding the output. I've tried adding more conditions to the inner join (as below).

inner join (
  select "conversationid" 
  from "fact_milestone_event" as fme
  where 
  fme.dim_segment_segment_id in ('20736b82-4515-411f-9bc8-cf4d84ad69ac') AND
  fme.timestamp >= '2022-12-18 11:00:00.000' AND
  fme.timestamp <= '2023-01-18 10:59:59.999' AND
  fme.dim_tenant_tenant_id = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'
  group by "conversationid"
) as "fme" on "fme"."conversationid" = "conversation_table"."conversationid" 

I think the sub-selects can be eliminated from the joins and their logic moved into the larger query. Rather than repeating the constraints, check for equality.

inner join fact_milestone_event as fme on
  fme.conversationid = conversation_table.conversationid
  and fme.dim_tenant_tenant_id = conversation_table.dim_tenant_tenant_id
  and fme.timestamp = conversation_table.timestamp
where 
  conversation_table.timestamp >= '2022-12-18 11:00:00.000' 
  and conversation_table.timestamp <= '2023-01-18 10:59:59.999' 
  and conversation_table.dim_tenant_tenant_id = '4621ed8f-d8a4-46e2-a8de-5710751b16b9'

The logic in notEmptyConservation_table can use the existing fme join.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM