简体   繁体   中英

Bigquery parallel processing based on column value

I have a table student_record with two columns

  1. studentId: int
  2. result: array of tuple (int, int): (subjectId, score)

I need to do analysis on each subject separately and there are over 100 subjects. Right now I just loop through all the subject s and save the output of the following query to a dataframe and do the analysis

SELECT studentId, res.subject, res.score 
FROM student_record, UNNEST(result) res 
WHERE res.subject = s

This query could take a long time to finish (100 + subjects, 100 million students) and it needs to be run for each subject.

I am wondering if there is a better way to perform such a task with parallel processing in BQ (eg run a single query and save results into local files indexed by subject?).

This query is very straightforward and should be pretty quick. If you are writing millions of rows to a dataframe, that is probably your bottleneck. I would consider one of the following approaches:

  1. Try to do your analysis in BQ rather than in a script. This depends on the analysis you are doing, but BQ has basic statistical functions .
with data as (
      select studentId, res.subject, res.score 
      from student_record, unnest(result) res 
)
select
    subject,
    count(distinct studentID) as student_count,
    avg(score) as avg_score,
    max(score) as max_score,
    min(score) as min_score,
    variance(score) as var_score,
    stddev(score) as std_dev_score,
    --- etc etc
from data
group by subject
  1. If you do need to write every studentID and score to a dataframe for each subject, I suggest materializing your query to a table and cluster by subject. Your subsequent queries (when filtered by subject) will be more efficient (and cheaper.).
create table dataset.student_record_clustered_by_subject
(  
  studentId string, -- or int depending on makeup of your column
  subject string, 
  score int -- or decimal if you have decimal places
)
cluster by subject
as (
      select studentId, res.subject, res.score 
      from student_record, unnest(result) res 
);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM