I am trying to perform KMeans using Spark MLlib on a huge matrix with around 3000,000 rows and 2048 columns. The size of this matrix is around 76GB. However, this matrix is stored on chunks of files on S3.
I am trying to set up Spark on EC2 instances through Amazon EMR. I have had trials to make appropriate configurations but have faced memory and disk errors when running KMeans on Amazon cluster. Following is the python script I use in order to create and configure the Amazon cluster.
import boto3
def lambda_handler():
client = boto3.client('emr', region_name='us-west-1')
client.run_job_flow(
Name='kmeans',
ReleaseLabel='emr-4.6.0',
Instances={
'MasterInstanceType': 'c3.8xlarge',
'SlaveInstanceType': 'c3.8xlarge',
'InstanceCount': 10,
'Ec2KeyName': 'spark',
'KeepJobFlowAliveWhenNoSteps': True,
'TerminationProtected': True
},
Steps=[
{
'Name': 'kmeans',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'spark-submit',
'--driver-memory','55G',
'--executor-memory','18G',
'--executor-cores','1',
'--num-executors','30',
'/home/hadoop/process_data.py'
]
}
},
],
BootstrapActions=[
{
'Name': 'cluster_setup',
'ScriptBootstrapAction': {
'Path': 's3://../setup.sh',
'Args': []
}
}
],
Applications=[
{
'Name': 'Spark'
},
],
Configurations=[
{
"Classification": "spark-env",
"Properties": {
},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python2.7",
"PYSPARK_DRIVER_PYTHON": "/usr/bin/python2.7"
},
"Configurations": [
]
}
]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.akka.frameSize": "2047",
"spark.driver.maxResultSize": "0"
}
}
],
VisibleToAllUsers=True,
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole'
)
if __name__=='__main__':
lambda_handler()
I'd appreciate if someone could give me a hint on the following parameters regarding the mentioned datasize for KMeans clustering?
@Quantad were you able to resolve your config issue? Recently, I found myself looking at K-means from Spark.ML. I had over 20G of data, over 6million rows and 200 columns. This is the config that I used(to make it work from the zeppelin environment but something similar could be done from spark-submit.
Master: m4.10xlarge
Worker: c3.8xlarge
(spark.executor.cores,4) #this could be increased to make use of all the cores
(spark.driver.memory,60g)
(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails - XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.shuffle.service.enabled,true)
(spark.master,yarn-client)
(spark.hadoop.yarn.timeline-service.enabled,false)
(spark.scheduler.mode,FAIR)
(spark.executor.memory,20g)
(spark.dynamicAllocation.enabled,true) #Spark 2.0.0, this takes care of the number of executors
If you are running into GC limit/java heap memory issue. You may also wanna adjust the
spark.yarn.executor.memoryOverhead(typically this 10% of the executor memory)
Other references that i have found to be helpful in the past, http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ [ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/][2]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.