简体   繁体   中英

How to call or import a table from Google cloud SQL into Spark dataframe?

I have created an instance in Google Dataproc and I am running pyspark over it. I am trying to import data from a table into this pyspark. So I created a table in Google cloud platform SQL. But I don't know how to call or import this table from other pyspark. Like I dont have any url kind of thing to point to this table. Could you please help in this regard.

Normally, you could use spark.read.jdbc() : How to work with MySQL and Apache Spark?

The challenge with Cloud SQL is networking -- figuring out how to connect to the instance. There's two main ways to do this:

1) Install the Cloud SQL proxy

You can use this initialization action to do that for you. Follow the instructions under "without configuring Hive metastore", since you don't need to do that:

gcloud dataproc clusters create <CLUSTER_NAME> \
    --scopes sql-admin \
    --initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \
    --metadata "enable-cloud-sql-hive-metastore=false"

The proxy is a local daemon that you can connect to on localhost:3306 and proxies to the cloud sql instance. You'd need to include localhost:3306 in your jdbc connection uri in spark.read.jdbc().

2) If you're instead willing to add to your driver classpath, you can consider installing the Cloud SQL Socket factory .

There's some discussion about how to do this here: https://groups.google.com/forum/#!topic/cloud-dataproc-discuss/Ns6umF_FX9g and here: Spark - Adding JDBC Driver JAR to Google Dataproc .

It sounds like you can either package it into a shaded application jar in pom.xml, or just provide it at runtime by adding it via --jars .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM