[英]What is the easy and best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system?
What is the best method to unzip the files in Azure data lake Gen1 without moving the files to Azure Databricks file system ? 在不将文件移动到Azure Databricks文件系统的情况下,在Azure数据湖Gen1中解压缩文件的最佳方法是什么? Currently, we are using Azure databricks for compute and ADLS for storage.We have a restriction to move the data into DBFS.
当前,我们使用Azure数据砖进行计算,并使用ADLS进行存储,但将数据移入DBFS受到限制。
Already mounted ADLS in DBFS and not sure how to proceed 已在DBFS中安装ADLS,并且不确定如何继续
Unfortunately in Databricks zip files are not supported, reason is that Hadoop does not have support for zip files as a compression codec. 不幸的是,在Databricks中不支持zip文件,原因是Hadoop不支持将zip文件作为压缩编解码器。 While a text file in GZip, BZip2, and other supported compression formats can be configured to be automatically decompressed in Spark as long as it has the right file extension, you must perform additional steps to read zip files.
只要GZip,BZip2和其他受支持的压缩格式的文本文件可以配置为在Spark中自动解压缩,只要它具有正确的文件扩展名,您就必须执行其他步骤来读取zip文件。 The sample in the Databricks documentation does the unzip on the driver node using unzip on the OS level (Ubuntu).
Databricks文档中的示例使用操作系统级别(Ubuntu)上的解压缩在驱动程序节点上解压缩。
If your data source can' t provide the data in a compression codec supported by Spark, best method is using Azure Data Factory copy activity. 如果您的数据源无法在Spark支持的压缩编解码器中提供数据,则最好的方法是使用Azure Data Factory复制活动。 Azure Data Factory supports more compression codecs, also zip is supported.
Azure数据工厂支持更多压缩编解码器,还支持zip。
Type property definition for the source would look like this: 源的类型属性定义如下所示:
"typeProperties": {
"compression": {
"type": "ZipDeflate",
"level": "Optimal"
},
You can also use Azure Data Factory to orchestrate your Databricks pipelines with the Databricks activities. 您还可以使用Azure数据工厂通过Databricks活动来编排Databricks管道。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.