I have a dataframe in a S3 bucket divided in 8 csv files of 709.7MB each one.
I create a EMR cluster with 8 nodes (r3.4xlarge: 16 vCPU, 122 RAM and 320 disk).
My Spark configurations are:
num-executors='23'
executor-memory='34G'
executor-cores='5'
I write this python script to load my dataframe:
df = sqlContext.read.load("s3://my-bucket/my-dataframe/*",
format='com.databricks.spark.csv',
header='true',
delimiter='\t',
inferSchema='true')
The problem: When I watch the stages in the Spark History Server, here is the result.
3 csv files are not load correctly. Someone has a solution to solve this problem or an idea of the cause please?
Have a look in the actual output, in case the reporting is confused.
BTW, that inferSchema option forces a scan through the entire CSV file just to work out its schema, here doubling the amount of data read from 700MB/file to 1400MB. If you are using long-haul data, you are doubling your bills; if local, well, that's still a lot of time frittered away. Work out the schema once and declare it in the DF.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.