Load file from S3 to nodes of an EMR cluster in pyspark

Question

I have a dataframe in a S3 bucket divided in 8 csv files of 709.7MB each one.

I create a EMR cluster with 8 nodes (r3.4xlarge: 16 vCPU, 122 RAM and 320 disk).

My Spark configurations are:

num-executors='23'
executor-memory='34G'
executor-cores='5'

I write this python script to load my dataframe:

df = sqlContext.read.load("s3://my-bucket/my-dataframe/*", 
                              format='com.databricks.spark.csv', 
                              header='true',
                              delimiter='\t',
                              inferSchema='true')

The problem: When I watch the stages in the Spark History Server, here is the result.

3 csv files are not load correctly. Someone has a solution to solve this problem or an idea of the cause please?

Answer 1

Have a look in the actual output, in case the reporting is confused.

BTW, that inferSchema option forces a scan through the entire CSV file just to work out its schema, here doubling the amount of data read from 700MB/file to 1400MB. If you are using long-haul data, you are doubling your bills; if local, well, that's still a lot of time frittered away. Work out the schema once and declare it in the DF.

Load file from S3 to nodes of an EMR cluster in pyspark

Question

1 answers

solution1
0 ACCPTED 2016-12-01 14:32:15

Load file from S3 to nodes of an EMR cluster in pyspark

Question

1 answers

solution1 0 ACCPTED 2016-12-01 14:32:15

solution1
0 ACCPTED 2016-12-01 14:32:15