BigQuery 基于列值的并行处理

Question

I have a table student_record with two columns我有一个包含两列的表student_record

studentId: int学生编号：int
result: array of tuple (int, int): (subjectId, score)结果：元组数组 (int, int): (subjectId, score)

I need to do analysis on each subject separately and there are over 100 subjects.我需要对每个主题分别进行分析，并且有 100 多个主题。 Right now I just loop through all the subject s and save the output of the following query to a dataframe and do the analysis现在我只是循环遍历所有 subject 并将以下查询s output 保存到 dataframe 并进行分析

SELECT studentId, res.subject, res.score 
FROM student_record, UNNEST(result) res 
WHERE res.subject = s

This query could take a long time to finish (100 + subjects, 100 million students) and it needs to be run for each subject.此查询可能需要很长时间才能完成（100 多个科目，1 亿学生）并且需要为每个科目运行。

I am wondering if there is a better way to perform such a task with parallel processing in BQ (eg run a single query and save results into local files indexed by subject?).我想知道是否有更好的方法在 BQ 中通过并行处理执行此类任务（例如，运行单个查询并将结果保存到按主题索引的本地文件中？）。

Answer 1

This query is very straightforward and should be pretty quick.这个查询非常简单，应该很快。 If you are writing millions of rows to a dataframe, that is probably your bottleneck.如果您要将数百万行写入 dataframe，那可能是您的瓶颈。 I would consider one of the following approaches:我会考虑以下方法之一：

Try to do your analysis in BQ rather than in a script.尝试在 BQ 中而不是在脚本中进行分析。 This depends on the analysis you are doing, but BQ has basic statistical functions .这取决于你做的分析，但 BQ 有基本的统计功能。

with data as (
      select studentId, res.subject, res.score 
      from student_record, unnest(result) res 
)
select
    subject,
    count(distinct studentID) as student_count,
    avg(score) as avg_score,
    max(score) as max_score,
    min(score) as min_score,
    variance(score) as var_score,
    stddev(score) as std_dev_score,
    --- etc etc
from data
group by subject

If you do need to write every studentID and score to a dataframe for each subject, I suggest materializing your query to a table and cluster by subject.如果您确实需要将每个 studentID 和每个科目的分数写入 dataframe，我建议将您的查询具体化为一个表并按科目聚类。 Your subsequent queries (when filtered by subject) will be more efficient (and cheaper.).您的后续查询（按主题过滤时）将更有效（也更便宜）。

create table dataset.student_record_clustered_by_subject
(  
  studentId string, -- or int depending on makeup of your column
  subject string, 
  score int -- or decimal if you have decimal places
)
cluster by subject
as (
      select studentId, res.subject, res.score 
      from student_record, unnest(result) res 
);

BigQuery 基于列值的并行处理

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-12-14 02:19:28

BigQuery 基于列值的并行处理

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-12-14 02:19:28

解决方案1
1 已采纳 2022-12-14 02:19:28