AutoLoader with a lot of empty parquet files

Question

I want to process some parquet files (with snappy compression) using AutoLoader in Databricks. A lot of those files are empty or contain just one record. Also, I cannot change how they are created, nor compact them.

Here are some of the approaches I tried so far:

I created a python notebook in Databricks and tried using AutoLoader to load the data. When I run it for a single table/folder, I can process it without a problem. However, when calling that notebook in a for loop for other tables ( for item in active_tables_metadata: -> dbutils.notebook.run("process_raw", 0, item) ), I only get empty folders in the target.
I created a Databricks Workflows Job and called the same notebook for each table/folder (sending the name/path of the table/folder via a parameter). This way, every table/folder was processed.
I used DBX to package the python scripts into a wheel and use it inside Databricks Workflows Job tasks as entrypoints. When doing this, I managed to create the same workflow as in point 2 above but, instead of calling a notebook, I am calling a python script (specified in entypoint of the task). Unfortunately, this way I only get empty folder in the target.
Copied over to a python notebook in Databricks all functions used in DBX python wheel and ran the notebook for one table/folder. I only got an empty folder in the target.

I have set the following AutoLoader configurations:

"cloudFiles.tenantId"
"cloudFiles.clientId"
"cloudFiles.clientSecret"
"cloudFiles.resourceGroup"
"cloudFiles.subscriptionId"
"cloudFiles.format": "parquet"
"pathGlobFilter": "*.snappy"
"cloudFiles.useNotifications": True
"cloudFiles.includeExistingFiles": True
"cloudFiles.allowOverwrites": True

I use the following readStream configurations:

spark.readStream.format("cloudFiles")
     .options(**CLOUDFILE_CONFIG)
     .option("cloudFiles.format", "parquet")
     .option("pathGlobFilter", "*.snappy")
     .option("recursiveFileLookup", True)
     .schema(schema)
     .option("locale", "de-DE")
     .option("dateFormat", "dd.MM.yyyy")
     .option("timestampFormat", "MM/dd/yyyy HH:mm:ss")
     .load(<path-to-source>)

And the following writeStream configurations:

df.writeStream.format("delta")
  .outputMode("append")
  .option("checkpointLocation", <path_to_checkpoint>)
  .queryName(<processed_table_name>)
  .partitionBy(<partition-key>)
  .option("mergeSchema", True)
  .trigger(once=True)
  .start(<path-to-target>)

My prefered solution would be to use DBX but I don't know why the job is succeeding yet, I only see empty folders in the target location. This is very strange behavior because I think AutoLoader is timing out reading only empty files after some time!

PS the same is also happening when I use parquet spark streaming instead of AutoLoader.

Do you know of any reason why this is happening and how can I overcome this issue?

Answer 1

Are you specifying the schema of the streaming read? (Sorry, can't add comments yet)

AutoLoader with a lot of empty parquet files

Question

1 answers

solution1
0 2022-09-27 12:23:05

AutoLoader with a lot of empty parquet files

Question

1 answers

solution1 0 2022-09-27 12:23:05

solution1
0 2022-09-27 12:23:05