简体繁体中英

Is Spark good for automatically running statistical analysis script in many nodes for a speedup?

原文 2023-02-01 18:05:34 5 1 apache-spark/ pyspark/ mapreduce/ amazon-emr

I have a Python script that runs statistical analysis and trained deep learning models on input data. The data size is fairly small (~5Mb) however the speed is slow due to the complexity of the analysis script. I wonder if it would be possible to use Spark to run my script in different nodes on a cluster so that I can gain a speedup. Basically, I want to divide the input data into many subsets and run the analysis script in parallel. Is Spark a good tool for this purpose? Thank you in advance!

1 answers

As long as you integrate your deep learning model into your pyspark pipeline and use partitioning , you can expect a speedup in the runtime. Without code, it's hard to make specific recommendations, but this article is a good place to start.

Startup script NOT running in instance

Spark worker nodes unable to access file on master node

Running liquibase script in google cloud platform

running script at boot time in windows azure vm

Writing Java code to check how many instances of an AWS function are running

AWS EKS "0/3 nodes are available: 3 Too many pods" Error

Azure AKS - how to mount a volume on a common disk for pods running on nodes from different availability zones?

running cloud init script in gce

Google Cloud Startup Script Not Running

Not able to create bigquery connection from within spark jar when running on dataproc cluster

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Startup script NOT running in instance Spark worker nodes unable to access file on master node Running liquibase script in google cloud platform running script at boot time in windows azure vm Writing Java code to check how many instances of an AWS function are running AWS EKS "0/3 nodes are available: 3 Too many pods" Error Azure AKS - how to mount a volume on a common disk for pods running on nodes from different availability zones? running cloud init script in gce Google Cloud Startup Script Not Running Not able to create bigquery connection from within spark jar when running on dataproc cluster

Related Tags

Is Spark good for automatically running statistical analysis script in many nodes for a speedup?

Question

1 answers

solution1 0 2023-02-01 18:54:22

solution1
0 2023-02-01 18:54:22