简体   繁体   中英

AutoLoader with a lot of empty parquet files

I want to process some parquet files (with snappy compression) using AutoLoader in Databricks. A lot of those files are empty or contain just one record. Also, I cannot change how they are created, nor compact them.

Here are some of the approaches I tried so far:

  1. I created a python notebook in Databricks and tried using AutoLoader to load the data. When I run it for a single table/folder, I can process it without a problem. However, when calling that notebook in a for loop for other tables ( for item in active_tables_metadata: -> dbutils.notebook.run("process_raw", 0, item) ), I only get empty folders in the target.
  2. I created a Databricks Workflows Job and called the same notebook for each table/folder (sending the name/path of the table/folder via a parameter). This way, every table/folder was processed.
  3. I used DBX to package the python scripts into a wheel and use it inside Databricks Workflows Job tasks as entrypoints. When doing this, I managed to create the same workflow as in point 2 above but, instead of calling a notebook, I am calling a python script (specified in entypoint of the task). Unfortunately, this way I only get empty folder in the target.
  4. Copied over to a python notebook in Databricks all functions used in DBX python wheel and ran the notebook for one table/folder. I only got an empty folder in the target.

I have set the following AutoLoader configurations:

  • "cloudFiles.tenantId"
  • "cloudFiles.clientId"
  • "cloudFiles.clientSecret"
  • "cloudFiles.resourceGroup"
  • "cloudFiles.subscriptionId"
  • "cloudFiles.format": "parquet"
  • "pathGlobFilter": "*.snappy"
  • "cloudFiles.useNotifications": True
  • "cloudFiles.includeExistingFiles": True
  • "cloudFiles.allowOverwrites": True

I use the following readStream configurations:

spark.readStream.format("cloudFiles")
     .options(**CLOUDFILE_CONFIG)
     .option("cloudFiles.format", "parquet")
     .option("pathGlobFilter", "*.snappy")
     .option("recursiveFileLookup", True)
     .schema(schema)
     .option("locale", "de-DE")
     .option("dateFormat", "dd.MM.yyyy")
     .option("timestampFormat", "MM/dd/yyyy HH:mm:ss")
     .load(<path-to-source>)

And the following writeStream configurations:

df.writeStream.format("delta")
  .outputMode("append")
  .option("checkpointLocation", <path_to_checkpoint>)
  .queryName(<processed_table_name>)
  .partitionBy(<partition-key>)
  .option("mergeSchema", True)
  .trigger(once=True)
  .start(<path-to-target>)

My prefered solution would be to use DBX but I don't know why the job is succeeding yet, I only see empty folders in the target location. This is very strange behavior because I think AutoLoader is timing out reading only empty files after some time!

PS the same is also happening when I use parquet spark streaming instead of AutoLoader.

Do you know of any reason why this is happening and how can I overcome this issue?

Are you specifying the schema of the streaming read? (Sorry, can't add comments yet)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM