简体   繁体   中英

Apache Spark on Google Cloud, Cluster or no Cluster

I want to use Apache Spark to manipulate a huge volume of data in Google Cloud.

I followed the documentation to spin-up a Cloud Dataproc cluster with 5 nodes. Everything works flawlessly.

But my data is on Google Cloud Storage and I learned that I can directly query it using Spark and that is recommended by Google.

In that case, is it necessary to spin-up an entire cluster? Is Spark as efficient on Google Cloud Storage as on HDFS?

If not, it would be easier to just spin one big VM with Jupyter and Spark and use it to run jobs on the data stored on GCS.

On a Dataproc cluster, you can use Spark to process data from HDFS as well as GCS (Google Cloud Storage), both equally efficiently. Size of your cluster needs to be decided based on computations you plan to perform in your spark job. There are a bunch of tradeoffs you need to consider while comparing one big VM vs multiple (smaller) VMs - mainly there is an upper limit on how much you can vertically scale (with one VM).

If you only need analyze data from Google Cloud Storage,I would recommend you create a cluster on dataproc just when you need. But it still depend on how long will this job takes, and how frequently you doing this job.

For example, you have a scheduled hourly ETL job. You can create new cluseter every hour, and delete when job done. It's very cost effective.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM