简体   繁体   English

在不将文件移动到Azure Databricks文件系统中的情况下,最简单,最好的方法是在Azure数据湖Gen1中解压缩文件吗?

[英]What is the easy and best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system?

What is the best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system ? 在不将文件移动到Azure Databricks文件系统的情况下,在Azure数据湖Gen1中解压缩文件的最佳方法是什么? Currently, we are using Azure databricks for compute and ADLS for storage.We have a restriction to move the data into DBFS. 当前,我们使用Azure数据砖进行计算,并使用ADLS进行存储,但将数据移入DBFS受到限制。

Already mounted ADLS in DBFS and not sure how to proceed 已在DBFS中安装ADLS,并且不确定如何继续

Unfortunately in Databricks zip files are not supported, reason is that Hadoop does not have support for zip files as a compression codec. 不幸的是,在Databricks中不支持zip文件,原因是Hadoop不支持将zip文件作为压缩编解码器。 While a text file in GZip, BZip2, and other supported compression formats can be configured to be automatically decompressed in Spark as long as it has the right file extension, you must perform additional steps to read zip files. 只要GZip,BZip2和其他受支持的压缩格式的文本文件可以配置为在Spark中自动解压缩,只要它具有正确的文件扩展名,您就必须执行其他步骤来读取zip文件。 The sample in the Databricks documentation does the unzip on the driver node using unzip on the OS level (Ubuntu). Databricks文档中的示例使用操作系统级别(Ubuntu)上的解压缩在驱动程序节点上解压缩。

If your data source can' t provide the data in a compression codec supported by Spark, best method is using Azure Data Factory copy activity. 如果您的数据源无法在Spark支持的压缩编解码器中提供数据,则最好的方法是使用Azure Data Factory复制活动。 Azure Data Factory supports more compression codecs, also zip is supported. Azure数据工厂支持更多压缩编解码器,还支持zip。

Type property definition for the source would look like this: 源的类型属性定义如下所示:

"typeProperties": {
        "compression": {
            "type": "ZipDeflate",
            "level": "Optimal"
        },

You can also use Azure Data Factory to orchestrate your Databricks pipelines with the Databricks activities. 您还可以使用Azure数据工厂通过Databricks活动来编排Databricks管道。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过 Azure Data Lake Store gen1 中的新文件触发 Azure Data Factory v2 或 Azure Databricks Notebook 中的管道 - How to trigger a pipeline in Azure Data Factory v2 or a Azure Databricks Notebook by a new file in Azure Data Lake Store gen1 复制Azure Data Lake Gen1中的文件/文件夹 - Copy File/Folders in Azure Data Lake Gen1 使用Azure Data Lake Store Gen1中的SSIS包将文件从一个目录移动到另一个目录 - Move Files from one directory to another using SSIS Package in Azure Data Lake Store Gen1 从 Azure EventHubs Capture 生成的 Azure Data Lake Gen1 使用 Databricks 读取 avro 数据失败 - Reading avro data with Databricks from Azure Data Lake Gen1 generated by Azure EventHubs Capture fails 从 Databricks 笔记本中的 Azure Data Lake Storage Gen1 获取嵌套文件夹的大小 - Fetch the size of nested folder from Azure Data Lake Storage Gen1 from Databricks notebook Azure 数据工厂 - Azure 数据湖 Gen1 访问 - Azure Data Factory - Azure Data Lake Gen1 access Azure Databricks 将文件写入 Azure Data Lake Gen 2 - Azure Databricks writing a file into Azure Data Lake Gen 2 在没有 Azure DataFactory 的情况下将文件和文件夹从 Azure DataLake Gen1 复制到 Azure DataLake Gen2 - Copy files and folders from Azure DataLake Gen1 to Azure DataLake Gen2 without Azure DataFactory 在 Azure Data Lake Storage Gen1 中将 Spark Dataframe 保存为 Delta Table 时,有没有办法在写入之前判断将创建多少个文件? - Is there a way to tell before the write how many files will be created when saving Spark Dataframe as Delta Table in Azure Data Lake Storage Gen1? 通过 azure databricks 从 azure datalake gen1 读取数据时要遵循的最佳实践 - The best practice to be followed when reading data from azure datalake gen1 through azure databricks
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM