简体   繁体   English

选择分组依据中的第一行或随机行

[英]Select first or random row in group by

I have this query using PostgreSQL 9.1 (9.2 as soon as our hosting platform upgrades): 我使用PostgreSQL 9.1(在我们的托管平台升级后为9.2)进行以下查询:

SELECT
    media_files.album,
    media_files.artist,
    ARRAY_AGG (media_files. ID) AS media_file_ids
FROM
    media_files
INNER JOIN playlist_media_files ON media_files.id = playlist_media_files.media_file_id
WHERE
    playlist_media_files.playlist_id = 1
GROUP BY
    media_files.album,
    media_files.artist
ORDER BY
    media_files.album ASC

and it's working fine, the goal was to extract album/artist combinations and in the result set have an array of media files ids for that particular combo. 而且效果很好,目标是提取专辑/歌手组合,并在结果集中包含该特定组合的媒体文件ID数组。

The problem is that I have another column in media files, which is artwork . 问题是媒体文件中还有一栏是artwork

artwork is unique for each media file (even in the same album) but in the result set I need to return just the first of the set. 每个媒体文件(即使在同一专辑中)的artwork也是唯一的,但是在结果集中,我只需要返回该集中的第一个。

So, for an album that has 10 media files, I also have 10 corresponding artworks, but I would like just to return the first (or a random picked one for that collection). 因此,对于具有10个媒体文件的专辑,我也有10个相应的作品,但我只想退回第一张(或从该专辑中随机挑选一张)。

Is that possible to do with only SQL/Window Functions (first_value over..)? 仅使用SQL / Window函数(first_value over ..)可以做到吗?

Yes, it's possible. 是的,有可能。 First, let's tweak your query by adding alias and explicit column qualifiers so it's clear what comes from where - assuming I've guessed correctly, since I can't be sure without table definitions: 首先,让我们通过添加别名和显式列限定符来调整查询,以便清楚地知道是从哪里来的-假设我猜对了,因为我不能确定没有表定义:

SELECT
    mf.album,
    mf.artist,
    ARRAY_AGG (mf.id) AS media_file_ids
FROM
    "media_files" mf
INNER JOIN "playlist_media_files" pmf ON mf.id = pmf.media_file_id
WHERE
    pmf.playlist_id = 1
GROUP BY
    mf.album,
    mf.artist
ORDER BY
    mf.album ASC

Now you can either use a subquery in the SELECT list or maybe use DISTINCT ON , though it looks like any solution based on DISTINCT ON will be so convoluted as not to be worth it. 现在,您可以在SELECT列表中使用子查询,也可以使用DISTINCT ON ,尽管看起来任何基于DISTINCT ON解决方案都会令人费解,不值得。

What you really want is something like an pick_arbitrary_value_agg aggregate that just picks the first value it sees and throws the rest away. 您真正想要的是诸如pick_arbitrary_value_agg聚合之类的东西, pick_arbitrary_value_agg选择它看到的第一个值,然后将其余的扔掉。 There is no such aggregate and it isn't really worth implementing it for the job. 没有这样的汇总,真的不值得为这项工作实现它。 You could use min(artwork) or max(artwork) and you may find that this actually performs better than the later solutions. 您可以使用min(artwork)max(artwork)并且您会发现它实际上比以后的解决方案更好。

To use a subquery, leave the ORDER BY as it is and add the following as an extra column in your SELECT list: 要使用子查询,请保持ORDER BY ,并将以下内容添加为SELECT列表中的额外列:

(SELECT mf2.artwork 
 FROM media_files mf2 
 WHERE mf2.artist = mf.artist
   AND mf2.album = mf.album
 LIMIT 1) AS picked_artwork

You can at a performance cost randomize the selected artwork by adding ORDER BY random() before the LIMIT 1 above. 您可以通过在上面的LIMIT 1之前添加ORDER BY random()来以性能成本将所选图稿随机化。

Alternately, here's a quick and dirty way to implement selection of a random row in-line: 或者,这是一种快速而脏的方法,用于实现对内联随机行的选择:

(array_agg(artwork))[width_bucket(random(),0,1,count(artwork)::integer)] 

Since there's no sample data I can't test these modifications. 由于没有示例数据,因此我无法测试这些修改。 Let me know if there's an issue. 让我知道是否有问题。

"First" pick “第一”选择

Wouldn't it be simpler / cheaper to just use min() : 只使用min()会不会更简单/更便宜:

SELECT m.album
      ,m.artist
      ,array_agg(m.id) AS media_file_ids
      ,min(m.artwork)  AS artwork
FROM   playlist_media_files p
JOIN   media_files          m ON m.id = p.media_file_id
WHERE  p.playlist_id = 1
GROUP  BY m.album, m.artist
ORDER  BY m.album, m.artist;

Abitrary / random pick 任意/随机选择

If you are looking for a random selection, @Craig already provided a solution with truly random picks. 如果您正在寻找随机选择, @ Craig已经提供了具有真正随机选择的解决方案。

You could also use a CTE to avoid additional scans on the (possibly big) base table and then run two separate (cheap) subqueries on the small result set. 您还可以使用CTE来避免对(可能是大的)基表进行额外的扫描,然后在小的结果集上运行两个单独的(便宜的)子查询。

For arbitrary selection - not truly random, the result will depend on the physical order of rows in the table and implementation-specifics: 对于任意选择-不是真正的随机,结果将取决于表中行的物理顺序和特定于实现的:

WITH x AS (
   SELECT m.album, m.artist, m.id, m.artwork
   FROM   playlist_media_files p
   JOIN   media_files          m ON m.id = p.media_file_id
   )
SELECT a.album, a.artist, a.media_file_ids, b.artwork
FROM  (
   SELECT album, artist, array_agg(id) AS media_file_ids
   FROM   x
   ) a
JOIN  (
   SELECT DISTINCT ON (1,2)  album, artist, artwork
   FROM x
   ) b USING (album, artist);

For truly random results, you can add an ORDER BY .. random() like this to subquery b : 对于真正的随机结果,可以向子查询b添加一个ORDER BY .. random()

JOIN  (
   SELECT DISTINCT ON (1, 2)  album, artist, artwork
   FROM   x
   ORDER  BY 1, 2, random()
   ) b USING (album, artist);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM