简体   繁体   中英

Spark configuration on EMR cluster for mllib kmeans algorithm

I am trying to perform KMeans using Spark MLlib on a huge matrix with around 3000,000 rows and 2048 columns. The size of this matrix is around 76GB. However, this matrix is stored on chunks of files on S3.

I am trying to set up Spark on EC2 instances through Amazon EMR. I have had trials to make appropriate configurations but have faced memory and disk errors when running KMeans on Amazon cluster. Following is the python script I use in order to create and configure the Amazon cluster.

import boto3

def lambda_handler():
  client = boto3.client('emr', region_name='us-west-1')
  client.run_job_flow(
      Name='kmeans',
      ReleaseLabel='emr-4.6.0',
      Instances={
          'MasterInstanceType': 'c3.8xlarge',
          'SlaveInstanceType': 'c3.8xlarge',
          'InstanceCount': 10,
          'Ec2KeyName': 'spark',
          'KeepJobFlowAliveWhenNoSteps': True,
          'TerminationProtected': True
      },
      Steps=[
          {
              'Name': 'kmeans',
              'ActionOnFailure': 'CANCEL_AND_WAIT',
              'HadoopJarStep': {
                  'Jar': 'command-runner.jar',
                  'Args': [
                      'spark-submit',
                      '--driver-memory','55G',
                      '--executor-memory','18G',
                      '--executor-cores','1',
                      '--num-executors','30',
                      '/home/hadoop/process_data.py'
                  ]
              }
          },
      ],
      BootstrapActions=[
          {
              'Name': 'cluster_setup',
              'ScriptBootstrapAction': {
                  'Path': 's3://../setup.sh',
                  'Args': []
              }
          }
      ],
      Applications=[
          {
              'Name': 'Spark'
          },
      ],
      Configurations=[
          {
              "Classification": "spark-env",
              "Properties": {

              },
              "Configurations": [
                  {
                      "Classification": "export",
                      "Properties": {
                          "PYSPARK_PYTHON": "/usr/bin/python2.7",
                          "PYSPARK_DRIVER_PYTHON": "/usr/bin/python2.7"
                      },
                      "Configurations": [

                      ]
                  }
              ]
          },
          {
              "Classification": "spark-defaults",
              "Properties": {
                  "spark.akka.frameSize": "2047",
                  "spark.driver.maxResultSize": "0"
              }
          }
      ],
      VisibleToAllUsers=True,
      JobFlowRole='EMR_EC2_DefaultRole',
      ServiceRole='EMR_DefaultRole'
  )

if __name__=='__main__':
    lambda_handler()

I'd appreciate if someone could give me a hint on the following parameters regarding the mentioned datasize for KMeans clustering?

  1. 'MasterInstanceType'
  2. 'SlaveInstanceType'
  3. 'InstanceCount'
  4. --driver-memory
  5. --executor-memory
  6. --executor-cores
  7. --num-executors
  8. spark.akka.frameSize
  9. spark.driver.maxResultSize

@Quantad were you able to resolve your config issue? Recently, I found myself looking at K-means from Spark.ML. I had over 20G of data, over 6million rows and 200 columns. This is the config that I used(to make it work from the zeppelin environment but something similar could be done from spark-submit.

Master: m4.10xlarge
Worker: c3.8xlarge

(spark.executor.cores,4) #this could be increased to make use of all the cores
(spark.driver.memory,60g)
(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails -   XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.shuffle.service.enabled,true)
(spark.master,yarn-client)
(spark.hadoop.yarn.timeline-service.enabled,false)
(spark.scheduler.mode,FAIR)
(spark.executor.memory,20g)
(spark.dynamicAllocation.enabled,true) #Spark 2.0.0, this takes care of the number of executors

If you are running into GC limit/java heap memory issue. You may also wanna adjust the

spark.yarn.executor.memoryOverhead(typically this 10% of the executor memory)

Other references that i have found to be helpful in the past, http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ [ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/][2]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM