简体   繁体   中英

Apache Spark optimization

I'm using Spark MLlib with Pyspark for my assignment and need to proove that it is better than traditional Machine learning methods. I have a dataset on which I'm running Logistic regression and I'm finding metrics like Aaccuracy, Precision, Recall etc.

While running code in PySpark and a normal Python script, I realized that the normal puthon script would finish execution faster which should not have been the case as there is a lot of data in the dataset. I soon digged deeper and realized that Spark just runs with 1 worker and is assigned just one core. Hence, I have made the following changes in spark-defaults.conf as I have a VM with 8 VCPU and 16 Gb RAM.

spark.driver.memory 8g
spark.driver.cores 8
spark.executor.instances 8

Now the time taken by Spark for running the ML code on the data has reduced significantly. Are there any further optimizations I should look at. I'm running Spark in a Stand-alone mode ie my master and worker are the same node.

Remember that Spark is targeted for the Big-Data environment, so probably it's not going to be the fastest solution for small datasets (size < 1GB) but it's going to be a must for very large ones(size > several TBs). This is casused by the Spark Java overhead, adding a lot of complexity wasted for small computations, while in cluster environments (Hadoop) this framework ensures that even if some nodes goes down, you'll be able to complete your tasks. For smaller datasets, all ML frameworks using GPUs are competitors with Spark, but after all Spark gives you a lot more than just ML.

Here's a couple articles that you may find useful for tuning: https://spark.apache.org/docs/latest/tuning.html https://spark.apache.org/docs/latest/sql-performance-tuning.html

My advice is using Dataframes and not RDDs whenever you can, since Catalyst Optimizer kicks in and speeds your jobs up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM