简体   繁体   中英

Loading Data from Google BigQuery into Spark (on Databricks)

I want to load data into Spark (on Databricks ) from Google BigQuery . I notice that Databricks offers alot of support for Amazon S3 but not for Google.

What is the best way to load data into Spark (on Databricks) from Google BigQuery? Would the BigQuery connector allow me to do this or is this only valid for files hosted on Google Cloud storage?

The BigQuery Connector is a client side library that uses the public BigQuery API: it runs BigQuery export jobs to Google Cloud Storage, and takes advantage of file creation ordering to start Hadoop processing early to increase overall throughput.

This code should work wherever you happen to locate your Hadoop cluster.

That said, if you are running over large data, then you might find network bandwidth throughput to be a problem (how good is your network connection to Google?), and since you are reading data out of Google's network, then GCS network egress costs will apply.

Databricks now has documented how to use Google BigQuery via Spark here

Set spark config in cluster settings:

credentials <base64-keys>

spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>

In pyspark use:

df = spark.read.format("bigquery") \
  .option("table", table) \
  .option("project", <project-id>) \
  .option("parentProject", <parent-project-id>) \
  .load()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM