简体   繁体   English

具有多个连接的 SQL 导致重复

[英]SQL with multiple joins causing duplicates

I am trying to make this query with multiple left joins, but returns duplicate updates and scientists for each charge associated with the project id (ex. if there are 5 charges then each update and scientist is returned 5 times).我正在尝试使用多个左连接进行此查询,但为与项目 ID 关联的每个费用返回重复的更新和科学家(例如,如果有 5 个费用,则每个更新和科学家返回 5 次)。 I'm trying to avoid multiple select statements but have been having trouble with this.我试图避免使用多个 select 语句,但遇到了这个问题。

SELECT
  projects.*,
  coalesce(json_agg(updates ORDER BY update_date DESC) FILTER (WHERE updates.id IS NOT NULL), '[]') AS updates,
  coalesce(json_agg(scientists) FILTER (WHERE scientists.user_id IS NOT NULL), '[]') AS scientists,
  coalesce(SUM(charges.amount), 0) AS donated,
  coalesce(COUNT(charges), 0) AS num_donations
FROM projects
LEFT JOIN updates
ON updates.project_id = projects.id
LEFT JOIN scientists
ON scientists.project_id = projects.id
LEFT JOIN charges
ON charges.project_id = projects.id
WHERE projects.id = '${id}'
GROUP BY projects.id;

Expected results (changed to only return ids):预期结果(更改为仅返回 ID):

                  id                  |                   updates                |             scientists             | donated | num_donations 
--------------------------------------+------------------------------------------+------------------------------------+---------+---------------
 17191850-9a03-482f-9afe-7dc6b69974ea | ["0c29417f-0afb-44df-a8cf-24dc5cc7962c"] | ["auth0|5efcfb5f652e5a0019ce2193"] |     155 |             5

Actual Results:实际结果:

                  id                  |                                                                                                 updates                                                                                                  |                                                                                 scientists                                                                                 | donated | num_donations 
--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+---------------
 17191850-9a03-482f-9afe-7dc6b69974ea | ["0c29417f-0afb-44df-a8cf-24dc5cc7962c", "0c29417f-0afb-44df-a8cf-24dc5cc7962c", "0c29417f-0afb-44df-a8cf-24dc5cc7962c", "0c29417f-0afb-44df-a8cf-24dc5cc7962c", "0c29417f-0afb-44df-a8cf-24dc5cc7962c"] | ["auth0|5efcfb5f652e5a0019ce2193", "auth0|5efcfb5f652e5a0019ce2193", "auth0|5efcfb5f652e5a0019ce2193", "auth0|5efcfb5f652e5a0019ce2193", "auth0|5efcfb5f652e5a0019ce2193"] |     155 |             5

If you have this:如果你有这个:

SELECT p.column, s.column, u.column
FROM 
  p 
  JOIN s ON ...
  JOIN u ON ...

And it produces one row它产生一行

p1, s1, u1

And then you join another table in:然后你加入另一个表:

SELECT p.column, s.column, u.column, c.column
FROM 
  p 
  JOIN s ON ...
  JOIN u ON ...
  JOIN c ON ...

And it suddenly produces 5 rows..它突然产生了 5 行..

p1, s1, u1, c1
p1, s1, u1, c2
p1, s1, u1, c3
p1, s1, u1, c4
p1, s1, u1, c5   

And you want it to produce one row again but with another column with a count of 5:并且您希望它再次生成一行,但另一列的计数为 5:

p1, s1, u1, 5

Then you need to group the repeating data and add a count:然后你需要对重复数据进行分组并添加一个计数:

SELECT p.column, s.column, u.column, count(*)
FROM 
  p 
  JOIN s ON ...
  JOIN u ON ...
  JOIN c ON ...
GROUP BY p.column, s.column, u.column

You'll note that the GROUP BY section is just an exact repeat of the SELECT section, minus the count (an aggregate column)您会注意到 GROUP BY 部分只是 SELECT 部分的精确重复,减去计数(聚合列)

The database will group the data up according to the key specified in the GROUP BY.数据库将根据 GROUP BY 中指定的键对数据进行分组。 p1, s1, u1 is a unique combination and is associated with 5 different c1 .. c5 values. p1, s1, u1是一个独特的组合,与 5 个不同的c1 .. c5值相关联。 The aggregation in this case doesn't apply to the cX data (because it's count(*), but it could - if we were to say:这种情况下的聚合不适用于 cX 数据(因为它是 count(*),但它可以- 如果我们要说:

SELECT p.column, s.column, u.column, min(c.column), max(c.column)

Then the DB makes this data set together with a bucket that contains all the c values:然后 DB 将此数据集与包含所有 c 值的存储桶一起制作:

p1, s1, u1, [c1, c2, c3, c4, c5]

And applies the MIN and MAX functions to the [c1, c2, c3, c4, c5] bucket pulling c1 and c5 respectively并将MIN和MAX函数分别应用于[c1, c2, c3, c4, c5]铲斗拉动c1c5

In your mind, get used to seeing grouping operations as preparing the unique combination of columns in the group by, plus having all these other items of data in a big unordered bucket, and the MAX/MIN/AVG etc functions operate on the bucket contents and pull the relevant data (which could come from any row, and naturally MIN and MAX will probably pull from different rows).在您看来,习惯于将分组操作视为准备分组中列的唯一组合,再加上将所有其他数据项放在一个大的无序存储桶中,并且 MAX/MIN/AVG 等函数对存储桶内容进行操作并提取相关数据(可能来自任何行,自然 MIN 和 MAX 可能会从不同的行中提取)。 Grouping loses the notion of "this input row" because it prepares a new set of rows分组失去了“这个输入行”的概念,因为它准备了一组新的行


In most typical grouping situations in various DBs you can't use SELECT * if you're grouping - you list out every one of the columns in the SELECT and again in the GROUP BY.在各种数据库中的大多数典型分组情况下,如果要分组,则不能使用SELECT * - 您列出 SELECT 中的每一列,并再次在 GROUP BY 中列出。 This might seem redundant (and indeed some databases allow you to skip providing a group by) but it's possible in advanced scenarios to group by different things than you select so it's only redundant in the simple case这似乎是多余的(确实有些数据库允许您跳过提供分组依据),但在高级场景中可以按与您选择的不同的事物分组,因此它仅在简单情况下是多余的


Now, hopefully you're down with all that above.现在,希望你对以上所有内容感到失望。 Some databases have functions that aren't just MIN/MAX etc but will concatenate all the results in the bucket.一些数据库的函数不仅是 MIN/MAX 等,还会连接存储桶中的所有结果。 Something like this pseudoSQL:像这样的伪SQL:

SELECT p.column, s.column, u.column, STRING_JOIN(c.column, '|')

Could produce:可以产生:

p1, s1, u1, c1|c2|c3|c4|c5

the string_join function is designed to concat all the things in the bucket, together using the pipe char specified as a delimiter.. string_join 函数旨在使用指定为分隔符的管道字符连接存储桶中的所有内容。

But remember that our original data was:但请记住,我们的原始数据是:

p1, s1, u1, c1
p1, s1, u1, c2
p1, s1, u1, c3
p1, s1, u1, c4
p1, s1, u1, c5  

If we were to GROUP BY just p.column, the DB would do p1 as the keys and more buckets:如果我们只对 p.column 进行 GROUP BY,那么 DB 会将 p1 作为键和更多的桶:

p1, [s1,s1,s1,s1,s1], [u1,u1,u1,u1,u1], [c1,c2,c3,c4,c5]

If you were to STRING_JOIN each of these you'd end up with what you asked for:如果您要 STRING_JOIN 中的每一个,您最终会得到您所要求的:

SELECT p.column, STRING_JOIN(s.column, '|'), STRING_JOIN(u.column, '|'), STRING_JOIN(c.column, '|'), 

p1, s1|s1|s1|s1|s1, u1|u1|u1|u1|u1, c1|c2|c3|c4|c5

There isn't anything AI in the DB that will look and say "i'll remove duplicates from the s and u buckets before I join" nor should there be.数据库中没有任何 AI 会看起来并说“我将在加入之前从 s 和 u 存储桶中删除重复项”,也不应该有。 As I mentioned before all concept of rows and ordering is lost when data goes into a bucket for aggregation.正如我之前提到的,当数据进入存储桶进行聚合时,所有行和排序的概念都会丢失。 If your data was:如果您的数据是:

p1, x1, y1
p1, x2, y2

And you grouped/joined you could end up with你分组/加入你可能最终得到

p1, x1|x2, y2|y1

See the order of elements in the Y string is inverted compared to x - don't rely on "the order of elements in the set" to infer anything about eg the row they came from originally看到 Y 字符串中元素的顺序与 x 相比是颠倒的 - 不要依赖“集合中元素的顺序”来推断任何有关例如它们最初来自的行的信息

So, what's going on with your query?那么,您的查询是怎么回事? Well, you're grouping by just one column and aggregating others, like above, so you can see how you get repetitions of the non grouped columns.好吧,您只按一列分组并聚合其他列,如上所示,因此您可以看到如何获得未分组列的重复。

If you kept on grouping by all the columns then you'd have your single scientists and updates.如果您继续按所有列分组,那么您将拥有单个科学家和更新。 If you desperately want them as JSON, then (assuming this really is postgres) you have to_json and row_to_json that will give a single json value, but it doesn't really add much that individual columns doesn't already give you.如果你非常想要它们作为 JSON,那么(假设这真的是 postgres)你有 to_json 和 row_to_json ,它们将提供单个 json 值,但它并没有真正增加单个列还没有给你的东西。 Postgres (if this is postgres) will allow you to GROUP BY * to let a json work: Postgres(如果这是 postgres)将允许您 GROUP BY * 让 json 工作:

SELECT p.column, row_to_json(s), row_to_json(u), count(*)
...
GROUP BY p.column, s.*, u.*

The presence of s.* and u.* will allow the row_to_json calls to produce the single row of json describing S and U, and the count will count the Cs s.* 和 u.* 的存在将允许 row_to_json 调用生成描述 S 和 U 的单行 json,并且计数将计算 Cs

Your joins mulitply the rows, since there are multiple matches in several tables, as it has been thoroughly explained by Caius Jard.您的连接乘以行,因为在几个表中有多个匹配项,正如 Caius Jard 已经彻底解释的那样。

A typical solution is to pre-aggregate in subqueries.一个典型的解决方案是在子查询中进行预聚合。 For your use case, where you are filtering on just on project, lateral joins should be the most efficient option:对于您仅对项目进行过滤的用例,横向连接应该是最有效的选择:

SELECT p.*, u.*, s.*, c.*
FROM projects
CROSS JOIN LATERAL (
    SELECT coalesce(json_agg(updates ORDER BY update_date DESC) FILTER (WHERE u.id IS NOT NULL), '[]') AS updates
    FROM updates u
    WHERE u.project_id = p.id
) u
CROSS JOIN LATERAL (
    SELECT coalesce(json_agg(scientists) FILTER (WHERE s.user_id IS NOT NULL), '[]') AS scientists
    FROM scientists s
    WHERE s.project_id = p.id
) s
CROSS JOIN LATERAL (
    SELECT coalesce(SUM(c.amount), 0) AS donated, coalesce(COUNT(charges), 0) AS num_donations
    FROM charges c
    WHERE c.project_id = p.id
) c ON TRUE
WHERE p.id = '${id}'

The basic problem is exactly the same as here:基本问题与这里完全相同

You later commented:你后来评论说:

there is only one distinct update and scientist in the DB associated with that project id与该项目 ID 关联的数据库中只有一个不同的更新和科学家

If that's guaranteed to be true, all you need is to aggregate rows from table charges before you join:如果这是真的,那么您只需要加入之前聚合表charges行:

SELECT p.*
     , COALESCE(to_json(u), '[]') AS updates
     , COALESCE(to_json(s), '[]') AS scientists
     , c.donated
     , c.num_donations
FROM   projects        p
LEFT   JOIN updates    u ON u.project_id = p.id
LEFT   JOIN scientists s ON s.project_id = p.id
CROSS  JOIN (
   SELECT COALESCE(SUM(amount), 0) AS donated
        , COUNT(*)    AS num_donations
   FROM   charges
   WHERE  project_id = '${id}'
   ) c
WHERE  p.id = '${id}'

The subquery on charges can be that simple because the only filter is the same ID as used in the outer query.关于charges的子查询可以如此简单,因为唯一的过滤器与外部查询中使用的 ID 相同。 We also do not need COALESCE() for the count because ...我们也不需要COALESCE()进行计数,因为 ...

  1. ... count() never returns NULL anyway. ... count()无论如何都不会返回 NULL。 See:看:
  2. ... the subquery (with aggregate functions and no GROUP BY ) is guaranteed to return exactly one row, aggregating all qualifying rows - even if 0 rows qualify. ...子查询(带有聚合函数且没有GROUP BY )保证只返回一行,聚合所有符合条件的行 - 即使 0 行符合条件。

If there can be multiple related rows in the tables updates or scientists after all, aggregate in a similar fashion before you CROSS JOIN .如果表updatesscientists毕竟可以有多个相关行,请在CROSS JOIN之前以类似的方式聚合。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM