简体   繁体   中英

Is there a way to tell before the write how many files will be created when saving Spark Dataframe as Delta Table in Azure Data Lake Storage Gen1?

I am currently trying to save a Spark Dataframe to Azure Data Lake Storage (ADLS) Gen1. While doing so I recevie the following throttling error:

org.apache.spark.SparkException: Job aborted. Caused by: com.microsoft.azure.datalake.store.ADLException: Error creating file /user/DEGI/CLCPM_DATA/fraud_project/policy_risk_motorcar_with_lookups/part-00000-34d88646-3755-488d-af00-ef2e201240c8-c000.snappy.parquet
Operation CREATE failed with HTTP401 : null
Last encountered exception thrown after 2 tries. [HTTP401(null),HTTP401(null)]

I read in the documentation that the throttling occurs due to CREATE limits, which then causes the job to abord. The documentation also gives three reasons why this may happen.

  1. Your application creates a large number of small files.
  2. External applications create a large number of files.
  3. The current limit for the subscription is too low.

While I do not think that my subscription is too low, I think it may be the case that my application is creating too many parquet files. Does anyone know how to tell how many files will be created when saving as table ? How can I find out the max number of files that I am allowed to create ?

The code that I use to create the table looks as follows:

df.write.format("delta").mode("overwrite").saveAsTable("database_name.df", path ='adl://my path to storage')
 

Also, I was able to write a smaller test dataframe without any problems.Plus The permissions of the folder in adls are set correctly.

The error you have doesn't look like an issue with number of file. 401 is an unauthorized issue. Nonetheless:

Spark writes at least as many file as there are partitions. So what you want is to do is repartition your dataframe. There are several repartition api, and to reduce partition and data distribution, it is recommended to use coalesce()

df.coalesce(10).write....

You can also read

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM