简体   繁体   中英

Load file from S3 to nodes of an EMR cluster in pyspark

I have a dataframe in a S3 bucket divided in 8 csv files of 709.7MB each one.

I create a EMR cluster with 8 nodes (r3.4xlarge: 16 vCPU, 122 RAM and 320 disk).

My Spark configurations are:

num-executors='23'
executor-memory='34G'
executor-cores='5'

I write this python script to load my dataframe:

df = sqlContext.read.load("s3://my-bucket/my-dataframe/*", 
                              format='com.databricks.spark.csv', 
                              header='true',
                              delimiter='\t',
                              inferSchema='true')

The problem: When I watch the stages in the Spark History Server, here is the result.

在此处输入图片说明

3 csv files are not load correctly. Someone has a solution to solve this problem or an idea of the cause please?

Have a look in the actual output, in case the reporting is confused.

BTW, that inferSchema option forces a scan through the entire CSV file just to work out its schema, here doubling the amount of data read from 700MB/file to 1400MB. If you are using long-haul data, you are doubling your bills; if local, well, that's still a lot of time frittered away. Work out the schema once and declare it in the DF.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM