简体   繁体   中英

How to select all records for only the first 50 distinct values in a column

I am trying to create a classifier model for a dataset, but I have too many distinct values for my target variable. If I run something like this:

Create or replace model `model_name`
options (model_type="AUTOML_CLASSIFIER", input_label_cols=["ORIGIN_AIRPORT"]) as
select DAY_OF_WEEK, ARRIVAL_TIME, ARRIVAL_DELAY, ORIGIN_AIRPORT
from `table_name`
limit 1000

I end up getting

Error running query
Classification model currently only supports classification with up to 50 unique labels and the label column had 111 unique labels.

So how can I select, for example, all rows that have one of the first 50 values of ORIGIN_AIRPORT ?

Given a table of values (val), with unique identifiers (id), find the minimum id (mid) for each unique value (val)

Return all rows which match the first 3 (densely ranked, by min id (mid)) vals.

The test data:

+------+----+
| val  | id |
+------+----+
|    1 |  1 |
|    1 |  2 |
|    8 |  3 |
|    8 |  4 |
|    8 |  5 |
|    7 |  6 |
|    7 |  7 |
|    6 |  8 |
|    5 |  9 |
|    4 | 10 |
|    3 | 11 |
|    3 | 12 |
|    7 | 13 |
|    7 | 14 |
|    1 | 15 |
|    8 | 16 |
|    3 | 17 |
|    1 | 18 |
+------+----+

The solution:

WITH min_ids (val, id, mid) AS (
        SELECT val
             , id
             , MIN(id) OVER (PARTITION BY val) AS mid   -- min id per val
          FROM vals
     )
   , ranks (val, id, mid, r) AS (
        SELECT val
             , id
             , mid
             , DENSE_RANK()  OVER (ORDER BY mid)      AS r   -- Densely ranked minimum ids
          FROM min_ids
     )
SELECT *
  FROM ranks
 WHERE r <= 3    -- Return rows matching r <= 3  (vals = 1, 8, and 7)
 ORDER BY r, id
;

The final result:

+------+----+------+---+
| val  | id | mid  | r |
+------+----+------+---+
|    1 |  1 |    1 | 1 |
|    1 |  2 |    1 | 1 |
|    1 | 15 |    1 | 1 |
|    1 | 18 |    1 | 1 |
|    8 |  3 |    3 | 2 |
|    8 |  4 |    3 | 2 |
|    8 |  5 |    3 | 2 |
|    8 | 16 |    3 | 2 |
|    7 |  6 |    6 | 3 |
|    7 |  7 |    6 | 3 |
|    7 | 13 |    6 | 3 |
|    7 | 14 |    6 | 3 |
+------+----+------+---+

There are a number of solutions. We could have obtained a distinct list of vals and LIMIT the number of rows returned, then join with that list of vals.

Since your original question was to obtain rows matching the first N values, I used that more strict logic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM