简体   繁体   中英

PySpark too slow in Google Cloud Dataproc

I deployed a PySpark ML model into a Google Cloud Dataproc cluster and it was running for over an hour, but my data is about 800 MB.

Is there anything being needed to declare as master on my SparkSession? I set the default option 'local'.

When you pass a local deploy mode option to SparkContext it executes your application locally on a single VM, to avoid this you should not pass any options in the SparkContext constructor - it will use pre-configured properties by Dataproc and run your application on YARN utilizing all cluster resources/nodes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM