简体   繁体   English

BigQuery 基于列值的并行处理

[英]Bigquery parallel processing based on column value

I have a table student_record with two columns我有一个包含两列的表student_record

  1. studentId: int学生编号:int
  2. result: array of tuple (int, int): (subjectId, score)结果:元组数组 (int, int): (subjectId, score)

I need to do analysis on each subject separately and there are over 100 subjects.我需要对每个主题分别进行分析,并且有 100 多个主题。 Right now I just loop through all the subject s and save the output of the following query to a dataframe and do the analysis现在我只是循环遍历所有 subject 并将以下查询s output 保存到 dataframe 并进行分析

SELECT studentId, res.subject, res.score 
FROM student_record, UNNEST(result) res 
WHERE res.subject = s

This query could take a long time to finish (100 + subjects, 100 million students) and it needs to be run for each subject.此查询可能需要很长时间才能完成(100 多个科目,1 亿学生)并且需要为每个科目运行。

I am wondering if there is a better way to perform such a task with parallel processing in BQ (eg run a single query and save results into local files indexed by subject?).我想知道是否有更好的方法在 BQ 中通过并行处理执行此类任务(例如,运行单个查询并将结果保存到按主题索引的本地文件中?)。

This query is very straightforward and should be pretty quick.这个查询非常简单,应该很快。 If you are writing millions of rows to a dataframe, that is probably your bottleneck.如果您要将数百万行写入 dataframe,那可能是您的瓶颈。 I would consider one of the following approaches:我会考虑以下方法之一:

  1. Try to do your analysis in BQ rather than in a script.尝试在 BQ 中而不是在脚本中进行分析。 This depends on the analysis you are doing, but BQ has basic statistical functions .这取决于你做的分析,但 BQ 有基本的统计功能
with data as (
      select studentId, res.subject, res.score 
      from student_record, unnest(result) res 
)
select
    subject,
    count(distinct studentID) as student_count,
    avg(score) as avg_score,
    max(score) as max_score,
    min(score) as min_score,
    variance(score) as var_score,
    stddev(score) as std_dev_score,
    --- etc etc
from data
group by subject
  1. If you do need to write every studentID and score to a dataframe for each subject, I suggest materializing your query to a table and cluster by subject.如果您确实需要将每个 studentID 和每个科目的分数写入 dataframe,我建议将您的查询具体化为一个表并按科目聚类。 Your subsequent queries (when filtered by subject) will be more efficient (and cheaper.).您的后续查询(按主题过滤时)将更有效(也更便宜)。
create table dataset.student_record_clustered_by_subject
(  
  studentId string, -- or int depending on makeup of your column
  subject string, 
  score int -- or decimal if you have decimal places
)
cluster by subject
as (
      select studentId, res.subject, res.score 
      from student_record, unnest(result) res 
);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM