简体   繁体   中英

Create 1GB partitions Spark SQL

I'm trying to split my data in 1GB when writing in S3 using spark. The approach I tried was to calculate the size of the DeltaTable in GB (the define_coalesce function), round, and using that number to write in S3:

# Vaccum to leave 1 week of history

deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")

deltaTable.vacuum(168)

deltaTable.generate("symlink_format_manifest")

# Reading delta table and rewriting with coalesce to reach 1GB per file

df = spark.read.format('delta').load(f"s3a://{delta_table}")

coalesce_number = define_coalesce(delta_table) < this function calculates the size of the delta in GB

df.coalesce(coalesce_number).write.format("delta").mode('overwrite').option('overwriteSchema', 'true').save(f"s3a://{delta_table}")

deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")

deltaTable.generate("symlink_format_manifest")

I'm trying this way cause our Delta is the opensource one and we don't have the optimize method built in.

I did some searching and found the spark.sql.files.maxPartitionBytes configuration in Spark, but some people said that it was not solving their problems, and that this config partitions when reading and not writing.

Any suggestions?

I understand your problem, and what you are trying to do but i am not sure what is the output of your current solution. If partitions are still not equal to 1 gb you may try to replace coalesce with repartition. Coalesce does not guarantee that after this operation partitions are equal so your formula may not work. If you know how many partition you need on output use repartition(coalesce_number) and it should create equal partitions with round robin

If the problem is with function which is calculating dataset size (so number of partitions) i know two solutions:

  1. You can cache dataset and then take its size from statistics. Of course this may be problematic and you have to spend some resource to due that. Something similar is done here in first answer: How spark get the size of a dataframe for broadcast?

  2. You can calculate count and divide it by number of records you want to have in single partition. Size of single record depends on your schema, it may be tricky to estimate it but it is viable option to try

Finally solved my problem. Since we are using Delta, I had the idea of trying to read the manifest files to find all the parquet names. After that, I get the sum of the list of parquets on manifest connecting in S3 with boto3:

def define_repartition(delta_table_path):
    conn = S3Connection()
    bk = conn.get_bucket(bucket)
    manifest = spark.read.text(f's3a://{delta_table_path}_symlink_format_manifest/manifest')
    parquets = [data[0].replace(f's3a://{bucket}/','') for data in manifest.select('value').collect()]
    size = 0
    for parquet in parquets:
        key = bk.lookup(parquet)
        size = size + key.size

    return round(size/1073741824)

Thank you all for the help.Regards from Brazil. :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM