[英]Randomly select rows in BigQuery table for each ID
I have a Biguery table consisting of multiple entries for each ID
for each day.我有一个 Biguery 表,其中包含每天每个
ID
的多个条目。 Basically, the IDs are stores with a list of products for which 2 columns represent properties.基本上,ID 是带有产品列表的商店,其中 2 列代表属性。
Store Product Date property1 property2
0 ID1 A1 202212-01 1 5
1 ID1 A1 202212-02 2 6
2 ID1 A1 202212-03 3 7
3 ID1 A1 202212-04 4 8
4 ID1 A1 202212-05 5 9
5 ID1 A1 202212-06 6 10
6 ID1 A1 202212-07 7 11
7 ID1 A1 202212-08 8 12
8 ID1 A1 202212-09 9 13
9 ID1 A1 202212-10 10 14
10 ID1 A2 202212-01 11 15
11 ID1 A2 202212-02 12 16
12 ID1 A2 202212-03 13 17
13 ID1 A2 202212-04 14 18
14 ID1 A2 202212-05 15 19
15 ID1 A2 202212-06 16 20
16 ID1 A2 202212-07 17 21
17 ID1 A2 202212-08 18 22
18 ID1 A2 202212-09 19 23
19 ID1 A2 202212-10 20 24
20 ID2 B1 202212-01 21 25
21 ID2 B1 202212-02 22 26
22 ID2 B1 202212-03 23 27
23 ID2 B1 202212-04 24 28
24 ID2 B1 202212-05 25 29
25 ID2 B1 202212-06 26 30
26 ID2 B1 202212-07 27 31
27 ID2 B1 202212-08 28 32
28 ID2 B1 202212-09 29 33
29 ID2 B1 202212-10 30 34
30 ID2 B2 202212-01 31 35
31 ID2 B2 202212-02 32 36
32 ID2 B2 202212-03 33 37
33 ID2 B2 202212-04 34 38
34 ID2 B2 202212-05 35 39
35 ID2 B2 202212-06 36 40
36 ID2 B2 202212-07 37 41
37 ID2 B2 202212-08 38 42
38 ID2 B2 202212-09 39 43
39 ID2 B2 202212-10 40 44
Now, the real table consists of more than a billion rows, so I want to take a random sample consisting of a sample of product for the last day of entry but it needs to be from ALL stores.现在,真实表包含超过十亿行,所以我想随机抽取一个样本,其中包含最后一天的产品样本,但它需要来自所有商店。
I tried the following approach:我尝试了以下方法:
Since I want the last date of entry I use a with
clause to limit to the last date ( max(DATE(product_timestamp))
) and list all the stores with another with
clause on stores.因为我想要输入的最后日期,所以我使用
with
子句来限制最后日期 ( max(DATE(product_timestamp))
) 并在商店中使用另一个with
子句列出所有商店。 I then take the random sample:然后我随机抽样:
query_random_sample = """
with maxdate as (select max(DATE(product_timestamp)) as maxdate from `MyProject.DataSet1.product_timeline`)
,
stores as (select store from `MyProject.DataSet1.stores`)
select t.*,
t2.ProductDescription,
t2.ProductName,
t2.CreatedDate,
from (`MyProject.DataSet1.product_timeline` as t
join `MyProject.DataSet2.LableStore` as t2
on t.store = t2.store
and t.barcode = t2.barcode
join maxdate
on maxdate.maxdate = DATE(t.product_timestamp)
)
join stores
on stores.store = t.store
where rand()< 0.01
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
]
)
sampled_labels = bigquery_client.query(query_random_sample, job_config=job_config).to_dataframe()
The problem is that it even samples on store
, but I want the sample to be on product
for each store.问题是它甚至在
store
上提供样品,但我希望样品在每个商店的product
上都有。
I work in Python and an alternative would be to do the query for each store, but the cost of such a query would be huge (over 1200 stores).我在 Python 工作,另一种方法是对每家商店进行查询,但这种查询的成本会很高(超过 1200 家商店)。
How can I solve this is a cost efficient way.我怎样才能解决这个问题,这是一种经济高效的方法。
If I'm right to assume you want a random sample specific to each store, then I think your best bet is using a window function to do your random selection, using a window partitioned by Store
:如果我假设你想要一个特定于每个商店的随机样本是正确的,那么我认为你最好的选择是使用 window function 进行随机选择,使用按
Store
分区的 window :
SELECT
Store,
Product,
Date,
property1,
property2,
FROM
`MyProject.DataSet1.product_timeline`
QUALIFY
PERCENT_RANK() OVER(all_stores_rand) < 0.01
WINDOW
all_stores_rand AS (
PARTITION BY Store
ORDER BY RAND()
)
To explain that, we are partitioning the table into one group per value of Store
(analogous to what we'd do for a GROUP BY
), then calculating PERCENT_RANK
over a set of random numbers separately for each store (generating these numbers using RAND()
).为了解释这一点,我们根据
Store
的值将表分成一组(类似于我们为GROUP BY
所做的),然后分别为每个商店计算一组随机数的PERCENT_RANK
(使用RAND()
生成这些数字RAND()
)。
Since the part of the table corresponding to each Store must then yield a set of values evenly spanning 0 to 1, we can throw this into a QUALIFY
(BigQuery's filter clause for window expressions) in order to just grab 1% of the values for each Store
.由于对应于每个 Store 的表部分必须产生一组从 0 到 1 均匀分布的值,我们可以将其放入
QUALIFY
(BigQuery 的 window 表达式的过滤器子句),以便为每个只获取 1% 的值Store
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.