简体   繁体   English

如何在 BigQuery SQL 中选择记录子集?

[英]How do I select a subset of records in BigQuery SQL?

I have a set of records in BigQuery with a variable (CPIRating) that I would like to use to select a subset from.我在 BigQuery 中有一组记录,其中包含一个变量 (CPIRating),我想使用该变量从中选择一个子集。

CPIRating is an integer with a range from 0.1 to 250. I have over 10,000 records. CPIRating 是一个整数,范围从 0.1 到 250。我有超过 10,000 条记录。 What I am trying to create is a single subset/dataset of all the records where我要创建的是所有记录的单个子集/数据集,其中

  1. It selects all records that have a CPIRating of 3.0 or greater它选择 CPIRating 为 3.0 或更高的所有记录
  2. counts the number of records that have a CPIRating of 3.0 or greater计算 CPIRating 为 3.0 或更高的记录数
  3. selects 4x that count of CPIRating 3.0 or greater and from the records that are less than 3.0 adds that number of records to the dataset, but does so from the lowest CPIRating value选择 CPIRating 3.0 或更高计数的 4 倍,并从小于 3.0 的记录中将该数量的记录添加到数据集,但从最低 CPIRating 值开始

As example, if the dataset has 1000 records with a CPIrating of 3.0 or greater, the query finds those, but also adds a further 4000 records (4x) that are below 3.0, but the 4000 records starts with the lowest CPIRating value (closest to 0.0) and adds those until it reaches the 4000.例如,如果数据集有 1000 条 CPIrating 为 3.0 或更高的记录,查询会找到这些记录,但还会添加另外 4000 条低于 3.0 的记录 (4x),但这 4000 条记录从最低的 CPIRating 值开始(最接近0.0) 并添加这些直到达到 4000。

Any ideas on how to structure that query in BigQuery?关于如何在 BigQuery 中构造该查询的任何想法?

First we generate some dummy data in table demo_tbl .首先我们在表demo_tbl中生成一些虚拟数据。 Since CPIRating is normal distributed in this example, we choose values between zero and 3.2 as a maximum.由于 CPIRating 在这个例子中是正态分布的,我们选择 0 到 3.2 之间的值作为最大值。

In the table help we calculate the rows, which have a CPIRating of 3 or higher.在表格help中,我们计算了CPIRating为 3 或更高的行。 from demo_tbl,help joins both tables together and we obtain an additional column CPIRating_count . from demo_tbl,help将两个表连接在一起,我们获得了一个额外的列CPIRating_count We numerate the rows by ascending CPIRating and create a row_number.我们通过升序CPIRating对行进行编号并创建一个 row_number。 Since this is a window function with over no where but a qualify clause is needed to filter the rows.因为这是一个没有where over窗口函数,所以需要一个qualify子句来过滤行。 In this filter the CPIRating<3.0 is not needed, but I find it easier to read.在此过滤器中, CPIRating<3.0 ,但我发现它更易于阅读。

With demo_tbl as (Select *, rand() *3.2 as CPIRating from unnest(generate_array(0,1*100)) id),
help as (select count(1) as CPIRating_count from demo_tbl where CPIRating>=3.0)

Select *,
row_number() over (order by CPIRating) as row_id
from demo_tbl,help
qualify (row_id < 4*help.CPIRating_count and CPIRating<3.0) or CPIRating>=3.0
order by row_id desc

The column CPIRating_count can also be generated by a window function instead of an join. CPIRating_count列也可以由窗口函数而不是连接生成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM