简体繁体中英

Apache Spark on Google Cloud, Cluster or no Cluster

原文 2018-08-13 13:41:22 7 2 apache-spark/ google-cloud-platform/ google-cloud-storage/ google-cloud-dataproc

I want to use Apache Spark to manipulate a huge volume of data in Google Cloud.

I followed the documentation to spin-up a Cloud Dataproc cluster with 5 nodes. Everything works flawlessly.

But my data is on Google Cloud Storage and I learned that I can directly query it using Spark and that is recommended by Google.

In that case, is it necessary to spin-up an entire cluster? Is Spark as efficient on Google Cloud Storage as on HDFS?

If not, it would be easier to just spin one big VM with Jupyter and Spark and use it to run jobs on the data stored on GCS.

2 answers

On a Dataproc cluster, you can use Spark to process data from HDFS as well as GCS (Google Cloud Storage), both equally efficiently. Size of your cluster needs to be decided based on computations you plan to perform in your spark job. There are a bunch of tradeoffs you need to consider while comparing one big VM vs multiple (smaller) VMs - mainly there is an upper limit on how much you can vertically scale (with one VM).

If you only need analyze data from Google Cloud Storage,I would recommend you create a cluster on dataproc just when you need. But it still depend on how long will this job takes, and how frequently you doing this job.

For example, you have a scheduled hourly ETL job. You can create new cluseter every hour, and delete when job done. It's very cost effective.

installing spark 1.4 on google cloud cluster

Apache Spark job runs locally but throwing null pointer on Google Cloud Cluster

Apache Spark 3 GPU cluster

Starting an Apache Spark cluster

apache spark yarn cluster

Unable to write files through Spark Programs to Google cloud cluster

spark-submit to kubernetes cluster on Google cloud got 401 unauthorized

apache shark installation on spark cluster

Apache Spark: reading RDD from Spark Cluster

Running two versions of Apache Spark in cluster mode

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question installing spark 1.4 on google cloud cluster Apache Spark job runs locally but throwing null pointer on Google Cloud Cluster Apache Spark 3 GPU cluster Starting an Apache Spark cluster apache spark yarn cluster Unable to write files through Spark Programs to Google cloud cluster spark-submit to kubernetes cluster on Google cloud got 401 unauthorized apache shark installation on spark cluster Apache Spark: reading RDD from Spark Cluster Running two versions of Apache Spark in cluster mode

Related Tags

Apache Spark on Google Cloud, Cluster or no Cluster

Question

2 answers

solution1
1 ACCPTED 2018-08-13 18:36:52

solution2
0 2019-02-10 10:27:46

Apache Spark on Google Cloud, Cluster or no Cluster

Question

2 answers

solution1 1 ACCPTED 2018-08-13 18:36:52

solution2 0 2019-02-10 10:27:46

solution1
1 ACCPTED 2018-08-13 18:36:52

solution2
0 2019-02-10 10:27:46