简体   繁体   English

如何使用 Presto SQL 获得可重复的样本?

[英]How to get repeatable sample using Presto SQL?

I am trying to get a sample of data from a large table and want to make sure this can be repeated later on.我正在尝试从一个大表中获取数据样本,并希望确保以后可以重复此操作。 Other SQL allow repeatable sampling to be done with either setting a seed using set.seed(integer) or repeatable (integer) command.其他 SQL 允许通过使用 set.seed(integer) 或 repeatable (integer) 命令设置种子来完成可重复采样。 However, this is not working for me in Presto.但是,这在 Presto 中对我不起作用。 Is such a command not available yet?这样的命令还没有吗? Thanks.谢谢。

One solution is that you can simulate the sampling by adding a column (or create a view) with random stuff (such as UUID) and then selecting rows by filtering on this column (for example, UUID ended with '1').一种解决方案是,您可以通过添加带有随机内容(例如 UUID)的列(或创建视图)来模拟采样,然后通过在此列上过滤来选择行(例如,UUID 以“1”结尾)。 You can tune the condition to get the sample size you need.您可以调整条件以获得所需的样本量。

By design, the result is random and also repeatable across multiple runs.根据设计,结果是随机的,并且可以在多次运行中重复。

You may create a simple intermediate table with selected ids:您可以使用选定的 id 创建一个简单的中间表:

CREATE TABLE IF NOT EXISTS <temp1>
AS
SELECT <id_column>
FROM <tablename> TABLESAMPLE SYSTEM (10);

This will contain only sampled ids and will be ready to use it downstream in your analysis by doing JOIN with data of interest.这将仅包含采样的 id,并且可以通过对感兴趣的数据进行JOIN来在下游分析中使用它。

If you are using Presto 0.263 or higher you can use key_sampling_percent to reproducibly generate a double between 0.0 and 1.0 from a varchar .如果您使用 Presto 0.263 或更高版本,您可以使用key_sampling_percentvarchar重现地生成 0.0 和 1.0 之间的双精度。

For example, to reproducibly sample 20% of records in table using the id column:例如,要使用id列可重复地对table中 20% 的记录进行采样:

select
    id
from table
where key_sampling_percent(id) < 0.2

If you are using an older version of Presto (eg AWS Athena), you can use what's in the source code for key_sampling_percent :如果您使用的是旧版本的 Presto(例如 AWS Athena),您可以使用key_sampling_percent的源代码中的内容

select
    id
from table
where (abs(from_ieee754_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2

I have found that you have to use from_big_endian_64 instead of from_ieee754_64 to get reliable results in Athena.我发现您必须使用from_big_endian_64而不是from_ieee754_64才能在 Athena 中获得可靠的结果。 Otherwise I got no many numbers close to zero because of the negative exponent.否则,由于负指数,我没有很多接近零的数字。

select id
    from table
    where (abs(from_big_endian_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM