简体   繁体   中英

What is the easy and best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system?

What is the best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system ? Currently, we are using Azure databricks for compute and ADLS for storage.We have a restriction to move the data into DBFS.

Already mounted ADLS in DBFS and not sure how to proceed

Unfortunately in Databricks zip files are not supported, reason is that Hadoop does not have support for zip files as a compression codec. While a text file in GZip, BZip2, and other supported compression formats can be configured to be automatically decompressed in Spark as long as it has the right file extension, you must perform additional steps to read zip files. The sample in the Databricks documentation does the unzip on the driver node using unzip on the OS level (Ubuntu).

If your data source can' t provide the data in a compression codec supported by Spark, best method is using Azure Data Factory copy activity. Azure Data Factory supports more compression codecs, also zip is supported.

Type property definition for the source would look like this:

"typeProperties": {
        "compression": {
            "type": "ZipDeflate",
            "level": "Optimal"
        },

You can also use Azure Data Factory to orchestrate your Databricks pipelines with the Databricks activities.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM