简体   繁体   中英

bigquery storage API: Is it possible to stream / save AVRO files directly to Google Cloud Storage?

I would like to export a 90 TB BigQuery table to Google Cloud Storage. According to the documentation , BigQuery Storage API (beta) should be the way to go due to export size quotas (eg, ExtractBytesPerDay) associated with other methods.

The table is date-partitioned, with each partition occupying ~300 GB. I have a Python AI Notebook running on GCP, which runs partitions (in parallel) through this script adapted from the docs .

from google.cloud import bigquery_storage_v1

client = bigquery_storage_v1.BigQueryReadClient()

table = "projects/{}/datasets/{}/tables/{}".format(
    "bigquery-public-data", "usa_names", "usa_1910_current"
) # I am using my private table instead of this one.

requested_session = bigquery_storage_v1.types.ReadSession()
requested_session.table = table
requested_session.data_format = bigquery_storage_v1.enums.DataFormat.AVRO

parent = "projects/{}".format(project_id)
session = client.create_read_session(
    parent,
    requested_session,
    max_stream_count=1,
)
reader = client.read_rows(session.streams[0].name)

# The read stream contains blocks of Avro-encoded bytes. The rows() method
# uses the fastavro library to parse these blocks as an iterable of Python
# dictionaries.

rows = reader.rows(session)

Is it possible to save data from the stream directly to Google Cloud Storage?

I tried saving tables as AVRO files to my AI instance using fastavro and later uploading them to GCS using Blob.upload_from_filename() , but this process is very slow. I was hoping it would be possible to point the stream at my GCS bucket. I experimented with Blob.upload_from_file, but couldn't figure it out.

I cannot decode the whole stream to memory and use Blob.upload_from_string because I don't have over ~300 GB of RAM.

I spent the last two days parsing GCP documentation, but couldn't find anything, so I would appreciate your help, preferably with a code snippet, if at all possible. (If working with another file format is easier, I am all for it.)

Thank you!

Is it possible to save data from the stream directly to Google Cloud Storage?

By itself, the BigQuery Storage API is not capable of writing directly to GCS; you'll need to pair the API with code to parse the data, write it to local storage, and subsequently upload to GCS. This could be code that you write manually, or code from a framework of some kind.

It looks like the code snippet that you've shared processes each partition in a single-threaded fashion, which caps your throughput at the throughput of a single read stream. The storage API is designed to achieve high throughput through parallelism, so it's meant to be used with a parallel processing framework such as Google Cloud Dataflow or Apache Spark. If you'd like to use Dataflow, there's a Google-provided template you can start from; for Spark, you can use the code snippets that David has already shared.

An easy way to that would be to use Spark with the thespark-bigquery-connector ? It uses the BigQuery Storage API in order to read the table directly into a Spark's DataFrame. You can create a Spark cluster on Dataproc , which is located as the same data centers as BigQuery and GCS, making the read and write speeds much faster.

A code example will look like this:

df = spark.read.format("bigquery") \
  .option("table", "bigquery-public-data.usa_names.usa_1910_current") \
  .load()

df.write.format("avro").save("gs://bucket/path")

You can also filter the data and work on each partition separately:

df = spark.read.format("bigquery") \
  .option("table", "bigquery-public-data.usa_names.usa_1910_current") \
  .option("filter", "the_date='2020-05-12'") \
  .load()

# OR, in case you don't need to give the partition at load

df = spark.read.format("bigquery") \
  .option("table", "bigquery-public-data.usa_names.usa_1910_current") \
  .load()

df.where("the_date='2020-05-12'").write....

Please note that in order to read large amounts of data you would need a sufficiently large cluster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM