简体   繁体   English

select 随机行的最佳方法 PostgreSQL

[英]Best way to select random rows PostgreSQL

I want a random selection of rows in PostgreSQL, I tried this:我想在 PostgreSQL 中随机选择行,我试过这个:

select * from table where random() < 0.01;

But some other recommend this:但其他一些人建议这样做:

select * from table order by random() limit 1000;

I have a very large table with 500 Million rows, I want it to be fast.我有一个非常大的表,有 5 亿行,我希望它快。

Which approach is better?哪种方法更好? What are the differences?有什么区别? What is the best way to select random rows? select 随机行的最佳方法是什么?

Fast ways快捷方式

Given your specifications (plus additional info in the comments),鉴于您的规格(以及评论中的其他信息),

  • You have a numeric ID column (integer numbers) with only few (or moderately few) gaps.您有一个数字 ID 列(整数),只有很少(或很少)间隙。
  • Obviously no or few write operations.显然没有或很少写操作。
  • Your ID column has to be indexed!您的 ID 列必须被索引! A primary key serves nicely.主键很好用。

The query below does not need a sequential scan of the big table, only an index scan.下面的查询不需要大表的顺序扫描,只需要索引扫描。

First, get estimates for the main query:首先,获取主查询的估计值:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
     , max(id)  AS max_id
     , max(id) - min(id) AS id_span
FROM   big;

The only possibly expensive part is the count(*) (for huge tables).唯一可能昂贵的部分是count(*) (对于大表)。 Given above specifications, you don't need it.鉴于上述规格,您不需要它。 An estimate to replace the full count will do just fine, available at almost no cost:替换完整计数的估计就可以了,几乎免费:

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
FROM   pg_class
WHERE  oid = 'big'::regclass;  -- your table name

Detailed explanation:详细解释:

As long as ct isn't much smaller than id_span , the query will outperform other approaches.只要ct小于id_span ,查询的性能就会优于其他方法。

WITH params AS (
   SELECT 1       AS min_id           -- minimum id <= current min id
        , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
   SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
   FROM   params p
        , generate_series(1, 1100) g  -- 1000 + buffer
   GROUP  BY 1                        -- trim duplicates
) r
JOIN   big USING (id)
LIMIT  1000;                          -- trim surplus
  • Generate random numbers in the id space.id空间中生成随机数。 You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.您有“很少的空白”,因此在要检索的行数中添加 10 %(足以轻松覆盖空白)。

  • Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT ).每个id都可以偶然被多次选择(尽管在 id 空间很大的情况下不太可能),因此对生成的数字进行分组(或使用DISTINCT )。

  • Join the id s to the big table.id加入到大表中。 This should be very fast with the index in place.有了索引,这应该非常快。

  • Finally trim surplus id s that have not been eaten by dupes and gaps.最后修剪掉没有被骗子和缺口吃掉的剩余id Every row has a completely equal chance to be picked.每一行都有完全平等的机会被选中。

Short version精简版

You can simplify this query.您可以简化此查询。 The CTE in the query above is just for educational purposes:上述查询中的 CTE 仅用于教育目的:

SELECT *
FROM  (
   SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
   FROM   generate_series(1, 1100) g
   ) r
JOIN   big USING (id)
LIMIT  1000;

Refine with rCTE使用 rCTE 进行优化

Especially if you are not so sure about gaps and estimates.特别是如果您对差距和估计不太确定。

WITH RECURSIVE random_pick AS (
   SELECT *
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
      LIMIT  1030                      -- hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss

   UNION                               -- eliminate dupe
   SELECT b.*
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   random_pick r             -- plus 3 percent - adapt to your needs
      LIMIT  999                       -- less than 1000, hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss
   )
TABLE  random_pick
LIMIT  1000;  -- actual limit

We can work with a smaller surplus in the base query.我们可以在基本查询中使用较小的盈余 If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term.如果有太多间隙,我们在第一次迭代中找不到足够的行,则 rCTE 继续使用递归项进行迭代。 We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.我们仍然需要 ID 空间中相对较少的间隙,否则递归可能会在达到限制之前干涸 - 或者我们必须从一个足够大的缓冲区开始,这违背了优化性能的目的。

Duplicates are eliminated by the UNION in the rCTE. rCTE 中的UNION消除了重复项。

The outer LIMIT makes the CTE stop as soon as we have enough rows.一旦我们有足够的行,外部LIMIT就会使 CTE 停止。

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry).此查询经过精心起草以使用可用索引,生成实际随机行并且在我们达到限制之前不会停止(除非递归运行枯竭)。 There are a number of pitfalls here if you are going to rewrite it.如果你要重写它,这里有很多陷阱。

Wrap into function包装成函数

For repeated use with the same table with varying parameters:对于具有不同参数的同一张表重复使用:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
  RETURNS SETOF big
  LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
   _surplus  int := _limit * _gaps;
   _estimate int := (           -- get current estimate from system
      SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
      FROM   pg_class
      WHERE  oid = 'big'::regclass);
BEGIN
   RETURN QUERY
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   generate_series(1, _surplus) g
         LIMIT  _surplus           -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  _limit             -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses
   )
   TABLE  random_pick
   LIMIT  _limit;
END
$func$;

Call:称呼:

SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

Generic function通用函数

We can make this generic to work for any table with a unique integer column (typically the PK): Pass the table as polymorphic type and (optionally) the name of the PK column and use EXECUTE :我们可以使这个泛型适用于具有唯一整数列(通常是 PK)的任何表:将表作为多态类型和(可选)PK 列的名称传递并使用EXECUTE

CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement
                                         , _id text = 'id'
                                         , _limit int = 1000
                                         , _gaps real = 1.03)
  RETURNS SETOF anyelement
  LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
   -- safe syntax with schema & quotes where needed
   _tbl text := pg_typeof(_tbl_type)::text;
   _estimate int := (SELECT (reltuples / relpages
                          * (pg_relation_size(oid) / 8192))::bigint
                     FROM   pg_class  -- get current estimate from system
                     WHERE  oid = _tbl::regclass);
BEGIN
   RETURN QUERY EXECUTE format(
   $$
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * $1)::int
         FROM   generate_series(1, $2) g
         LIMIT  $2                 -- hint for query planner
         ) r(%2$I)
      JOIN   %1$s USING (%2$I)     -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * $1)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  $3                 -- hint for query planner
         ) r(%2$I)
      JOIN   %1$s USING (%2$I)     -- eliminate misses
   )
   TABLE  random_pick
   LIMIT  $3;
   $$
 , _tbl, _id
   )
   USING _estimate              -- $1
       , (_limit * _gaps)::int  -- $2 ("surplus")
       , _limit                 -- $3
   ;
END
$func$;

Call with defaults (important!):使用默认值调用(重要!):

SELECT * FROM f_random_sample(null::big);  --!

Or more specifically:或者更具体地说:

SELECT * FROM f_random_sample(null::"my_TABLE", 'oDD ID', 666, 1.15);

About the same performance as the static version.与静态版本的性能大致相同。

Related:有关的:

This is safe against SQL injection.这对 SQL 注入是安全的。 See:看:

Possible alternative可能的替代方案

I your requirements allow identical sets for repeated calls (and we are talking about repeated calls) consider a MATERIALIZED VIEW .如果您的要求允许重复调用相同的集合(我们正在谈论重复调用),请考虑MATERIALIZED VIEW Execute above query once and write the result to a table.执行一次上述查询并将结果写入表。 Users get a quasi random selection at lightening speed.用户以闪电般的速度获得准随机选择。 Refresh your random pick at intervals or events of your choosing.每隔一段时间或您选择的事件刷新您的随机选择。

Postgres 9.5 introduces TABLESAMPLE SYSTEM (n) Postgres 9.5 引入TABLESAMPLE SYSTEM (n)

Where n is a percentage.其中n是百分比。 The manual: 手册:

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100 . BERNOULLISYSTEM采样方法都接受一个参数,该参数是要采样的表的分数,表示为0 到 100 之间的百分比 This argument can be any real -valued expression.该参数可以是任何real值表达式​​。

Bold emphasis mine.大胆强调我的。 It's very fast , but the result is not exactly random .非常快,但结果并不完全随机 The manual again:再看说明书:

The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.当指定较小的采样百分比时, SYSTEM方法明显快于BERNOULLI方法,但由于聚类效应,它可能会返回表的随机样本较少。

The number of rows returned can vary wildly.返回的行数可以变化很大。 For our example, to get roughly 1000 rows:对于我们的示例,要获得大约1000 行:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Related:有关的:

Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:或者安装附加模块tsm_system_rows以准确获取请求的行数(如果有足够的)并允许使用更方便的语法:

SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);

See Evan's answer for details.有关详细信息,请参阅埃文的答案

But that's still not exactly random.但这仍然不是完全随机的。

You can examine and compare the execution plan of both by using您可以使用检查和比较两者的执行计划

EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;

A quick test on a large table 1 shows, that the ORDER BY first sorts the complete table and then picks the first 1000 items.对大表1的快速测试显示, ORDER BY首先对整个表进行排序,然后选择前 1000 个项目。 Sorting a large table not only reads that table but also involves reading and writing temporary files.对大表进行排序不仅会读取该表,还涉及读取和写入临时文件。 The where random() < 0.1 only scans the complete table once. where random() < 0.1只扫描整个表一次。

For large tables this might not what you want as even one complete table scan might take to long.对于大型表,这可能不是您想要的,因为即使是一次完整的表扫描也可能需要很长时间。

A third proposal would be第三个建议是

select * from table where random() < 0.01 limit 1000;

This one stops the table scan as soon as 1000 rows have been found and therefore returns sooner.一旦找到 1000 行,此操作就会停止表扫描,因此会更快返回。 Of course this bogs down the randomness a bit, but perhaps this is good enough in your case.当然,这会稍微降低随机性,但在您的情况下,这可能已经足够了。

Edit: Besides of this considerations, you might check out the already asked questions for this.编辑:除了这些考虑之外,您还可以查看已经提出的问题。 Using the query [postgresql] random returns quite a few hits.使用查询[postgresql] random返回相当多的命中。

And a linked article of depez outlining several more approaches:以及 depez 的链接文章概述了更多方法:


1 "large" as in "the complete table will not fit into the memory". 1 “大”,如“完整的表不适合内存”。

postgresql order by random(), select rows in random order: postgresql order by random(),以随机顺序选择行:

This is slow because it orders the whole table to guarantee that every row gets an exactly equal chance of being chosen.这很慢,因为它对整个表进行排序以保证每一行都有完全相同的机会被选中。 A full table scan is unavoidable for perfect randomness.对于完美的随机性,全表扫描是不可避免的。

select your_columns from your_table ORDER BY random()

postgresql order by random() with a distinct: postgresql order by random() 具有不同的:

select * from 
  (select distinct your_columns from your_table) table_alias
ORDER BY random()

postgresql order by random limit one row: postgresql order by 随机限制一行:

This is also slow, because it has to table scan to make sure every row that might be chosen has an equal chance of being chosen, right this instant:这也很慢,因为它必须进行表扫描以确保可能选择的每一行都有相同的机会被选择,就在这一刻:

select your_columns from your_table ORDER BY random() limit 1

Constant Time Select Random N rows with periodic table scan:恒定时间使用周期表扫描选择随机 N 行:

If your table is huge then the above table-scans are a show stopper taking up to 5 minutes to finish.如果您的表很大,那么上面的表扫描是一个显示停止器,最多需要 5 分钟才能完成。

To go faster you can schedule a behind the scenes nightly table-scan reindexing which will guarantee a perfectly random selection in an O(1) constant-time speed, except during the nightly reindexing table-scan, where it must wait for maintenance to finish before you may receive another random row.为了更快,您可以在后台安排夜间表扫描重新索引,这将保证以O(1)恒定时间速度进行完美随机选择,除了在夜间重新索引表扫描期间,它必须等待维护完成在您收到另一个随机行之前。

--Create a demo table with lots of random nonuniform data, big_data 
--is your huge table you want to get random rows from in constant time. 
drop table if exists big_data;  
CREATE TABLE big_data (id serial unique, some_data text );  
CREATE INDEX ON big_data (id);  
--Fill it with a million rows which simulates your beautiful data:  
INSERT INTO big_data (some_data) SELECT md5(random()::text) AS some_data
FROM generate_series(1,10000000);
 
--This delete statement puts holes in your index
--making it NONuniformly distributed  
DELETE FROM big_data WHERE id IN (2, 4, 6, 7, 8); 
 
 
--Do the nightly maintenance task on a schedule at 1AM.
drop table if exists big_data_mapper; 
CREATE TABLE big_data_mapper (id serial, big_data_id int); 
CREATE INDEX ON big_data_mapper (id); 
CREATE INDEX ON big_data_mapper (big_data_id); 
INSERT INTO big_data_mapper(big_data_id) SELECT id FROM big_data ORDER BY id;
 
--We have to use a function because the big_data_mapper might be out-of-date
--in between nightly tasks, so to solve the problem of a missing row, 
--you try again until you succeed.  In the event the big_data_mapper 
--is broken, it tries 25 times then gives up and returns -1. 
CREATE or replace FUNCTION get_random_big_data_id()  
RETURNS int language plpgsql AS $$ 
declare  
    response int; 
BEGIN
    --Loop is required because big_data_mapper could be old
    --Keep rolling the dice until you find one that hits.
    for counter in 1..25 loop
        SELECT big_data_id 
        FROM big_data_mapper OFFSET floor(random() * ( 
            select max(id) biggest_value from big_data_mapper 
            )
        ) LIMIT 1 into response;
        if response is not null then
            return response;
        end if;
    end loop;
    return -1;
END;  
$$; 
 
--get a random big_data id in constant time: 
select get_random_big_data_id(); 
 
--Get 1 random row from big_data table in constant time: 
select * from big_data where id in ( 
    select get_random_big_data_id() from big_data limit 1 
); 
┌─────────┬──────────────────────────────────┐ 
│   id    │            some_data             │ 
├─────────┼──────────────────────────────────┤ 
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │ 
└─────────┴──────────────────────────────────┘ 

--Get 4 random rows from big_data in constant time: 
select * from big_data where id in ( 
    select get_random_big_data_id() from big_data limit 3 
);
┌─────────┬──────────────────────────────────┐ 
│   id    │            some_data             │ 
├─────────┼──────────────────────────────────┤ 
│ 2722848 │ fab6a7d76d9637af89b155f2e614fc96 │ 
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │ 
│ 9475611 │ 36ac3eeb6b3e171cacd475e7f9dade56 │ 
└─────────┴──────────────────────────────────┘ 

--Test what happens when big_data_mapper stops receiving 
--nightly reindexing.
delete from big_data_mapper where 1=1; 
select get_random_big_data_id();   --It tries 25 times, and returns -1
                                   --which means wait N minutes and try again.

Adapted from: https://www.gab.lc/articles/bigdata_postgresql_order_by_random改编自: https ://www.gab.lc/articles/bigdata_postgresql_order_by_random

Alternatively if all the above is too much work.或者,如果上述所有工作太多。

A simpler good 'nuff solution for constant time select random row is to make a new column on your big table called big_data .对于恒定时间选择随机行,一个更简单的好“nuff”解决方案是在您的大表上创建一个名为big_data的新列。 mapper_int make it not null with a unique index. mapper_int使用唯一索引使其不为空。 Every night reset the column with a unique integer between 1 and max(n).每天晚上用 1 到 max(n) 之间的唯一整数重置列。 To get a random row you "choose a random integer between 0 and max(id) " and return the row where mapper_int is that.要获得随机行,您“选择0max(id)之间的随机整数”并返回 mapper_int 所在的行。 If there's no row by that id, because the row has changed since re-index, choose another random row.如果该 id 没有行,因为该行在重新索引后已更改,请选择另一个随机行。 If a row is added to big_data.mapper_int then populate it with max(id) + 1如果将一行添加到 big_data.mapper_int 然后用 max(id) + 1 填充它

Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table :从 PostgreSQL 9.5 开始,有一种专门用于从表中获取随机元素的新语法:

SELECT * FROM mytable TABLESAMPLE SYSTEM (5);

This example will give you 5% of elements from mytable .此示例将为您提供mytable中 5% 的元素。

See more explanation on the documentation: http://www.postgresql.org/docs/current/static/sql-select.html查看文档的更多解释: http ://www.postgresql.org/docs/current/static/sql-select.html

The one with the ORDER BY is going to be the slower one.带有 ORDER BY 的那个将是较慢的那个。

select * from table where random() < 0.01; goes record by record, and decides to randomly filter it or not.逐条记录,并决定是否随机过滤。 This is going to be O(N) because it only needs to check each record once.这将是O(N) ,因为它只需要检查每条记录一次。

select * from table order by random() limit 1000; is going to sort the entire table, then pick the first 1000. Aside from any voodoo magic behind the scenes, the order by is O(N * log N) .将对整个表进行排序,然后选择前 1000 个。除了幕后的任何巫术魔法之外,顺序是O(N * log N)

The downside to the random() < 0.01 one is that you'll get a variable number of output records. random() < 0.01的缺点是您将获得可变数量的输出记录。


Note, there is a better way to shuffling a set of data than sorting by random: The Fisher-Yates Shuffle , which runs in O(N) .请注意,有一种比随机排序更好的方法来打乱一组数据: Fisher-Yates Shuffle ,它在O(N)中运行。 Implementing the shuffle in SQL sounds like quite the challenge, though.不过,在 SQL 中实现 shuffle 听起来颇具挑战。

select * from table order by random() limit 1000;

If you know how many rows you want, check out tsm_system_rows .如果您知道需要多少行,请查看tsm_system_rows

tsm_system_rows tsm_system_rows

module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command.模块提供了表采样方法SYSTEM_ROWS,可以在SELECT命令的TABLESAMPLE子句中使用。

This table sampling method accepts a single integer argument that is the maximum number of rows to read.此表采样方法接受一个整数参数,该参数是要读取的最大行数。 The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected.生成的样本将始终包含那么多行,除非表不包含足够的行,在这种情况下会选择整个表。 Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested. SYSTEM_ROWS 与内置的 SYSTEM 采样方法一样,执行块级采样,因此样本不是完全随机的,而是可能会受到聚类效应的影响,尤其是在仅请求少量行的情况下。

First install the extension首先安装扩展

CREATE EXTENSION tsm_system_rows;

Then your query,然后你的查询,

SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);

Here is a decision that works for me.这是一个对我有用的决定。 I guess it's very simple to understand and execute.我想这很容易理解和执行。

SELECT 
  field_1, 
  field_2, 
  field_2, 
  random() as ordering
FROM 
  big_table
WHERE 
  some_conditions
ORDER BY
  ordering 
LIMIT 1000;

If you want just one row, you can use a calculated offset derived from count .如果您只想要一行,则可以使用从count派生的计算offset

select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));

One lesson from my experience:我的经验教训之一:

offset floor(random() * N) limit 1 is not faster than order by random() limit 1 . offset floor(random() * N) limit 1不比order by random() limit 1快。

I thought the offset approach would be faster because it should save the time of sorting in Postgres.我认为offset方法会更快,因为它可以节省 Postgres 中的排序时间。 Turns out it wasn't.原来不是。

A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetter is possible. Erwin Brandstetter 概述的物化视图“可能的替代方案”的变体是可能的。

Say, for example, that you don't want duplicates in the randomized values that are returned.例如,您不希望返回的随机值中有重复项。 So you will need to set a boolean value on the primary table containing your (non-randomized) set of values.因此,您需要在包含您的(非随机)值集的主表上设置一个布尔值。

Assuming this is the input table:假设这是输入表:

id_values  id  |   used
           ----+--------
           1   |   FALSE
           2   |   FALSE
           3   |   FALSE
           4   |   FALSE
           5   |   FALSE
           ...

Populate the ID_VALUES table as needed.根据需要填充ID_VALUES表。 Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES table once:然后,如 Erwin 所述,创建一个物化视图,将ID_VALUES表随机化一次:

CREATE MATERIALIZED VIEW id_values_randomized AS
  SELECT id
  FROM id_values
  ORDER BY random();

Note that the materialized view does not contain the used column, because this will quickly become out-of-date.请注意,物化视图不包含已使用的列,因为它很快就会过时。 Nor does the view need to contain other columns that may be in the id_values table.视图也不需要包含可能在id_values表中的其他列。

In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values , selecting id_values from id_values_randomized with a join, and applying the desired criteria to obtain only relevant possibilities.为了获得(和“使用”)随机值,请在 id_values 上使用 UPDATE- id_values ,从id_values_randomized中选择id_values并进行连接,并应用所需的标准来仅获得相关的可能性。 For example:例如:

UPDATE id_values
SET used = TRUE
WHERE id_values.id IN 
  (SELECT i.id
    FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
    WHERE (NOT i.used)
    LIMIT 5)
RETURNING id;

Change LIMIT as necessary -- if you only need one random value at a time, change LIMIT to 1 .根据需要更改LIMIT - 如果您一次只需要一个随机值,请将LIMIT更改为1

With the proper indexes on id_values , I believe the UPDATE-RETURNING should execute very quickly with little load.使用id_values上的正确索引,我相信 UPDATE-RETURNING 应该在很少负载的情况下非常快速地执行。 It returns randomized values with one database round-trip.它通过一次数据库往返返回随机值。 The criteria for "eligible" rows can be as complex as required. “合格”行的标准可以根据需要复杂。 New rows can be added to the id_values table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time).可以随时将新行添加到id_values表中,一旦物化视图刷新(可能在非高峰时间运行),应用程序就可以访问它们。 Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's are added to the id_values table.物化视图的创建和刷新会很慢,但它只需要在新的 id 被添加到id_values表时执行。

Add a column called r with type serial .添加一个名为r的列,类型为serial Index r .索引r

Assume we have 200,000 rows, we are going to generate a random number n , where 0 < n <= 200, 000.假设我们有 200,000 行,我们将生成一个随机数n ,其中 0 < n <= 200, 000。

Select rows with r > n , sort them ASC and select the smallest one.选择r > n的行,按ASC排序并选择最小的行。

Code:代码:

select * from YOUR_TABLE 
where r > (
    select (
        select reltuples::bigint AS estimate
        from   pg_class
        where  oid = 'public.YOUR_TABLE'::regclass) * random()
    )
order by r asc limit(1);

The code is self-explanatory.代码是不言自明的。 The subquery in the middle is used to quickly estimate the table row counts from https://stackoverflow.com/a/7945274/1271094 .中间的子查询用于快速估计来自https://stackoverflow.com/a/7945274/1271094的表行数。

In application level you need to execute the statement again if n > the number of rows or need to select multiple rows.在应用程序级别,如果n > 行数或需要选择多行,则需要再次执行该语句。

I know I'm a little late to the party, but I just found this awesome tool called pg_sample :我知道我参加聚会有点晚了,但我刚刚发现了这个很棒的工具,叫做pg_sample

pg_sample - extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity. pg_sample - 从较大的 PostgreSQL 数据库中提取一个小的样本数据集,同时保持引用完整性。

I tried this with a 350M rows database and it was really fast, don't know about the randomness .我用一个 350M 行的数据库试过这个,它真的很快,不知道随机性

./pg_sample --limit="small_table = *" --limit="large_table = 100000" -U postgres source_db | psql -U postgres target_db

I think the best way is:我认为最好的方法是:

SELECT * FROM tableName ORDER BY random() LIMIT 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM