Spark job (shuffle) taking too long to finish the job

Question

I'm running a Spark job on EMR and trying to convert a large zipped CSV file (15GB) to parquet but it is taking too long to write to S3.

I'm using a R5 instance for master (1 instance) and core (3 instances). Here is my code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date

def main():
    spark = SparkSession \
        .builder \
        .appName("csv-to-parquer-convertor") \
        .config("spark.sql.catalogimplementation", "hive") \
        .config("hive.metastore.connect.retries", 3) \
        .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
        .enableHiveSupport().getOrCreate()   

    tgt_filename = 'SOME_Prefix'
    src_path = 'SOURCE_S3_PATH'
    tgt_path = 'TARGET_ BUCKET' + tgt_filename 

    df = spark.read.csv(src_path, header=True)
    partitioned_df = df.repartition(50)
    partitioned_df.write.mode('append').parquet(path=tgt_path)
    spark.stop()

if __name__ == "__main__":
    main()

Answer 1

If you want better performance stop using S3. I am being serious. You simply aren't doing enough work to really optimize your code. This is a simple problem. You will get better performance if you change what takes the longest. (Your read/write speed.) This is for sure the bottleneck in your problem. To fix this, you need to look at using something that performs better than S3. An HDFS cluster performs better and works with spark out of the box, so might be a good first alternative. Their are of course also other options, but it depends what your comfortable with.

Spark job (shuffle) taking too long to finish the job

Question

1 answers

solution1
0 2021-12-16 13:19:26

Spark job (shuffle) taking too long to finish the job

Question

1 answers

solution1 0 2021-12-16 13:19:26

solution1
0 2021-12-16 13:19:26