简体   繁体   中英

Best way to get BigQuery temp table created by Job to read large data faster

I am trying to execute a query over a table in BigQuery using its Java client libraries. I create a Job and then get the result of Job using job.getQueryResults().iterateAll() method.

This way is working but for large data like 600k it takes time around 80-120 seconds. I see BigQuery gets data in 40-45k batches which takes around 5-7 sec each.

I want to get the results faster and I found over internet that if we can get the temporary table created by BigQuery from the Job and the read the data in avro or some other format from that table if will be really fast, but in BigQuery API(using version: 1.124.7 ) I don't see that way. Does anyone know how to do that in Java, or how to get data faster in case of large number of records. Any help is appreciated.

Code to Read Table(Takes 20 sec)

Table table = bigQueryHelper.getBigQueryClient().getTable(TableId.of("project","dataset","table"));
    String format = "CSV";
    String gcsUrl = "gs://name/test.csv";
    Job job = table.extract(format, gcsUrl);
    // Wait for the job to complete
    try {
      Job completedJob = job.waitFor(RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
              RetryOption.totalTimeout(Duration.ofMinutes(3)));
      if (completedJob != null && completedJob.getStatus().getError() == null) {
        log.info("job done");
        // Job completed successfully
      } else {
        log.info("job has error");
        // Handle error case
      }
    } catch (InterruptedException e) {
      // Handle interrupted wait
    }

Code to read same table using Query(Takes 90 Sec)

Job job = bigQueryHelper.getBigQueryClient().getJob(JobId.of(jobId));
      for (FieldValueList row : job.getQueryResults().iterateAll()) {
        System.out.println(row);
}

I tried certain ways and based on that found the best way of doing it, just thought to post here to help some one in future.

1: If we use job.getQueryResults().iterateAll() on job or directly on table, it takes same time. So if we don't give batch size BigQuery will use batch size of around 35-45k and fetch the data. So for 600k rows (180Mb) it takes 70-100 sec .

2: We can use the temp table details from created job and use extract job feature of table to write the result in GCS, this will be faster and takes around 30-35 sec . This approach would not download on local for that we again need to use..iterateAll() on temp table and it will be take same time as 1.

Example pseudo code:

try {
      Job job = getBigQueryClient().getJob(JobId.of(jobId));
      long start = System.currentTimeMillis();
      //      FieldList list = getFields(job);

      Job completedJob =
          job.waitFor(
              RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
              RetryOption.totalTimeout(Duration.ofMinutes(3)));
      if (completedJob != null && completedJob.getStatus().getError() == null) {
        log.info("job done");
        String gcsUrl = "gs://bucketname/test";
        //getting the temp table information of the Job
        TableId destinationTableInfo =
            ((QueryJobConfiguration) job.getConfiguration()).getDestinationTable();
        log.info("Total time taken in getting schema ::{}", (System.currentTimeMillis() - start));
        Table table = bigQueryHelper.getBigQueryClient().getTable(destinationTableInfo);
        //Using extract job to write the data in GCS
        Job newJob1 =
            table.extract(
                CsvOptions.newBuilder().setFieldDelimiter("\t").build().toString(), gcsUrl);
        System.out.println("DestinationInfo::" + destinationTableInfo);
        Job completedJob1 =
            newJob1.waitFor(
                RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
                RetryOption.totalTimeout(Duration.ofMinutes(3)));
        if (completedJob1 != null && completedJob1.getStatus().getError() == null) {
          log.info("job done");
        } else {
          log.info("job has error");
        }
      } else {
        log.info("job has error");
      }
    } catch (InterruptedException e) {
      e.printStackTrace();
    }

3: This is the best way which I wanted. It downloads/writes the result faster in local file. It downloads data in around 20 sec . This is the new way BigQuery provides and can be checked using below links:

https://cloud.google.com/bigquery/docs/reference/storage/libraries#client-libraries-install-java

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM