[英]Bigquery parallel processing based on column value
I have a table student_record
with two columns我有一个包含两列的表student_record
I need to do analysis on each subject separately and there are over 100 subjects.我需要对每个主题分别进行分析,并且有 100 多个主题。 Right now I just loop through all the subject s
and save the output of the following query to a dataframe and do the analysis现在我只是循环遍历所有 subject 并将以下查询s
output 保存到 dataframe 并进行分析
SELECT studentId, res.subject, res.score
FROM student_record, UNNEST(result) res
WHERE res.subject = s
This query could take a long time to finish (100 + subjects, 100 million students) and it needs to be run for each subject.此查询可能需要很长时间才能完成(100 多个科目,1 亿学生)并且需要为每个科目运行。
I am wondering if there is a better way to perform such a task with parallel processing in BQ (eg run a single query and save results into local files indexed by subject?).我想知道是否有更好的方法在 BQ 中通过并行处理执行此类任务(例如,运行单个查询并将结果保存到按主题索引的本地文件中?)。
This query is very straightforward and should be pretty quick.这个查询非常简单,应该很快。 If you are writing millions of rows to a dataframe, that is probably your bottleneck.如果您要将数百万行写入 dataframe,那可能是您的瓶颈。 I would consider one of the following approaches:我会考虑以下方法之一:
with data as (
select studentId, res.subject, res.score
from student_record, unnest(result) res
)
select
subject,
count(distinct studentID) as student_count,
avg(score) as avg_score,
max(score) as max_score,
min(score) as min_score,
variance(score) as var_score,
stddev(score) as std_dev_score,
--- etc etc
from data
group by subject
create table dataset.student_record_clustered_by_subject
(
studentId string, -- or int depending on makeup of your column
subject string,
score int -- or decimal if you have decimal places
)
cluster by subject
as (
select studentId, res.subject, res.score
from student_record, unnest(result) res
);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.