简体   繁体   中英

Spark job (shuffle) taking too long to finish the job

I'm running a Spark job on EMR and trying to convert a large zipped CSV file (15GB) to parquet but it is taking too long to write to S3.

I'm using a R5 instance for master (1 instance) and core (3 instances). Here is my code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date

def main():
    spark = SparkSession \
        .builder \
        .appName("csv-to-parquer-convertor") \
        .config("spark.sql.catalogimplementation", "hive") \
        .config("hive.metastore.connect.retries", 3) \
        .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
        .enableHiveSupport().getOrCreate()   

    tgt_filename = 'SOME_Prefix'
    src_path = 'SOURCE_S3_PATH'
    tgt_path = 'TARGET_ BUCKET' + tgt_filename 

    df = spark.read.csv(src_path, header=True)
    partitioned_df = df.repartition(50)
    partitioned_df.write.mode('append').parquet(path=tgt_path)
    spark.stop()

if __name__ == "__main__":
    main()

If you want better performance stop using S3. I am being serious. You simply aren't doing enough work to really optimize your code. This is a simple problem. You will get better performance if you change what takes the longest. (Your read/write speed.) This is for sure the bottleneck in your problem. To fix this, you need to look at using something that performs better than S3. An HDFS cluster performs better and works with spark out of the box, so might be a good first alternative. Their are of course also other options, but it depends what your comfortable with.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM