简体   繁体   English

每个 ID 在 BigQuery 表中随机 select 行

[英]Randomly select rows in BigQuery table for each ID

I have a Biguery table consisting of multiple entries for each ID for each day.我有一个 Biguery 表,其中包含每天每个ID的多个条目。 Basically, the IDs are stores with a list of products for which 2 columns represent properties.基本上,ID 是带有产品列表的商店,其中 2 列代表属性。

   Store Product       Date  property1  property2
0    ID1      A1  202212-01          1          5
1    ID1      A1  202212-02          2          6
2    ID1      A1  202212-03          3          7
3    ID1      A1  202212-04          4          8
4    ID1      A1  202212-05          5          9
5    ID1      A1  202212-06          6         10
6    ID1      A1  202212-07          7         11
7    ID1      A1  202212-08          8         12
8    ID1      A1  202212-09          9         13
9    ID1      A1  202212-10         10         14
10   ID1      A2  202212-01         11         15
11   ID1      A2  202212-02         12         16
12   ID1      A2  202212-03         13         17
13   ID1      A2  202212-04         14         18
14   ID1      A2  202212-05         15         19
15   ID1      A2  202212-06         16         20
16   ID1      A2  202212-07         17         21
17   ID1      A2  202212-08         18         22
18   ID1      A2  202212-09         19         23
19   ID1      A2  202212-10         20         24
20   ID2      B1  202212-01         21         25
21   ID2      B1  202212-02         22         26
22   ID2      B1  202212-03         23         27
23   ID2      B1  202212-04         24         28
24   ID2      B1  202212-05         25         29
25   ID2      B1  202212-06         26         30
26   ID2      B1  202212-07         27         31
27   ID2      B1  202212-08         28         32
28   ID2      B1  202212-09         29         33
29   ID2      B1  202212-10         30         34
30   ID2      B2  202212-01         31         35
31   ID2      B2  202212-02         32         36
32   ID2      B2  202212-03         33         37
33   ID2      B2  202212-04         34         38
34   ID2      B2  202212-05         35         39
35   ID2      B2  202212-06         36         40
36   ID2      B2  202212-07         37         41
37   ID2      B2  202212-08         38         42
38   ID2      B2  202212-09         39         43
39   ID2      B2  202212-10         40         44

Now, the real table consists of more than a billion rows, so I want to take a random sample consisting of a sample of product for the last day of entry but it needs to be from ALL stores.现在,真实表包含超过十亿行,所以我想随机抽取一个样本,其中包含最后一天的产品样本,但它需要来自所有商店。

I tried the following approach:我尝试了以下方法:

Since I want the last date of entry I use a with clause to limit to the last date ( max(DATE(product_timestamp)) ) and list all the stores with another with clause on stores.因为我想要输入的最后日期,所以我使用with子句来限制最后日期 ( max(DATE(product_timestamp)) ) 并在商店中使用另一个with子句列出所有商店。 I then take the random sample:然后我随机抽样:

query_random_sample = """ 
with maxdate as (select max(DATE(product_timestamp)) as maxdate from `MyProject.DataSet1.product_timeline`)
,
stores as (select store from `MyProject.DataSet1.stores`)

select t.*,
t2.ProductDescription,
t2.ProductName,
t2.CreatedDate,

from (`MyProject.DataSet1.product_timeline` as t
          join `MyProject.DataSet2.LableStore` as t2
          on t.store = t2.store
          and t.barcode = t2.barcode
          join maxdate 
          on maxdate.maxdate = DATE(t.product_timestamp)
          )  
          join stores 
          on stores.store = t.store
where rand()< 0.01

"""

job_config = bigquery.QueryJobConfig(
    query_parameters=[
    ]
)

sampled_labels = bigquery_client.query(query_random_sample, job_config=job_config).to_dataframe()

The problem is that it even samples on store , but I want the sample to be on product for each store.问题是它甚至在store上提供样品,但我希望样品在每个商店的product上都有。

I work in Python and an alternative would be to do the query for each store, but the cost of such a query would be huge (over 1200 stores).我在 Python 工作,另一种方法是对每家商店进行查询,但这种查询的成本会很高(超过 1200 家商店)。

How can I solve this is a cost efficient way.我怎样才能解决这个问题,这是一种经济高效的方法。

If I'm right to assume you want a random sample specific to each store, then I think your best bet is using a window function to do your random selection, using a window partitioned by Store :如果我假设你想要一个特定于每个商店的随机样本是正确的,那么我认为你最好的选择是使用 window function 进行随机选择,使用按Store分区的 window :

    SELECT
        Store,
        Product,
        Date,
        property1,
        property2,
      FROM
       `MyProject.DataSet1.product_timeline`
   QUALIFY
        PERCENT_RANK() OVER(all_stores_rand) < 0.01
    WINDOW
        all_stores_rand AS (
          PARTITION BY Store
          ORDER BY RAND()
        )

To explain that, we are partitioning the table into one group per value of Store (analogous to what we'd do for a GROUP BY ), then calculating PERCENT_RANK over a set of random numbers separately for each store (generating these numbers using RAND() ).为了解释这一点,我们根据Store的值将表分成一组(类似于我们为GROUP BY所做的),然后分别为每个商店计算一组随机数的PERCENT_RANK (使用RAND()生成这些数字RAND() )。

Since the part of the table corresponding to each Store must then yield a set of values evenly spanning 0 to 1, we can throw this into a QUALIFY (BigQuery's filter clause for window expressions) in order to just grab 1% of the values for each Store .由于对应于每个 Store 的表部分必须产生一组从 0 到 1 均匀分布的值,我们可以将其放入QUALIFY (BigQuery 的 window 表达式的过滤器子句),以便为每个只获取 1% 的值Store

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM