简体   繁体   中英

How do I count the frequency of items for each column with bigquery?

I have a column from a Google Bigquery Table which has certain results (na, pass, warn, and error) based on criteria (c1, c2, c3, ...), for example.

 data    c1      c2      c3      ...
------- ------- ------- ------- -------
 data1   pass    na      warn    ...
 data2   pass    na      pass    ...
 data3   error   pass    error   ...

Now I want to get a histogram for each column of the input table, and I want to create the summarized table as follows.

 criteria   na      pass    warn    error
---------- ------- ------- ------- -------
 c1         0       2       0       1
 c2         2       1       0       0
 c3         0       1       1       1
 ...        ...     ...     ...     ...

Actually, there are many criteria, so I want to create the summarized table without enumerating the criteria. How can I get the output table?

You may consider below,

CREATE TEMP FUNCTION json_keys(input STRING) RETURNS Array<String> 
LANGUAGE js AS """
  return Object.keys(JSON.parse(input));
""";

CREATE TEMP FUNCTION json_values(input STRING) RETURNS Array<String> 
LANGUAGE js AS """
  return Object.values(JSON.parse(input));
""";

SELECT * FROM (
  SELECT criteria, result
    FROM (SELECT * EXCEPT(data) FROM sample_table) t,
  UNNEST (json_keys(TO_JSON_STRING(t))) criteria WITH OFFSET JOIN
  UNNEST (json_values(TO_JSON_STRING(t))) result WITH OFFSET USING (offset)
) PIVOT (COUNT(result) FOR result IN ('na', 'pass', 'warn', 'error'));

+----------+----+------+------+-------+
| criteria | na | pass | warn | error |
+----------+----+------+------+-------+
| c1       |  0 |    2 |    0 |     1 |
| c2       |  2 |    1 |    0 |     0 |
| c3       |  0 |    1 |    1 |     1 |
+----------+----+------+------+-------+

You can unpivot then pivot again to get the table you want as such:

input as (
  select "data1" as data, "pass" as c1,"warn" as c2,"na" as c3, "pass" as c4 union all 
  select "data2" as data, "na" as c1,"pass" as c2,"pass" as c3, "error" as c4 union all 
  select "data3" as data, "warn" as c1,"pass" as c2,"na" as c3, "warn" as c4 union all 
  select "data4" as data, "pass" as c1,"error" as c2,"pass" as c3, "error" as c4 union all 
  select "data5" as data, "pass" as c1,"na" as c2,"na" as c3, "error" as c4 union all 
  select "data6" as data, "pass" as c1,"warn" as c2,"na" as c3, "na" as c4 
),

step_unpivot as (
select * from input unpivot(results for criterias in (c1,c2,c3,c4))
)

select *  from step_unpivot pivot(count(data) for results in ('pass', 'warn','na','error'))

And since you want dynamic column selection for the unpivot(), you can probably use the answer to this question for that part: Dynamic UnPivot in Bigquery

Edit: Below a version with the dynamic unpivot


DECLARE myunpivot STRING;
SET myunpivot = (
  SELECT CONCAT('(', STRING_AGG( column_name, ','), ')'),
From(
SELECT column_name FROM dataset_name.INFORMATION_SCHEMA.COLUMNS
where table_name ="unpivot_pivot" 
and column_name not in("data")));

EXECUTE IMMEDIATE format("""
with
step_unpivot as (
select * from dataset_name.table_name
unpivot(results for criterias in %s)
)

select *  from step_unpivot pivot(count(data) for results in ('pass', 'warn','na','error'))
""", myunpivot);


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM