简体   繁体   English

Azure Databricks 将文件写入 Azure Data Lake Gen 2

[英]Azure Databricks writing a file into Azure Data Lake Gen 2

I have an Azure Data Lake gen1 and an Azure Data Lake gen2 (Blob Storage w/hierarchical) and I am trying to create a Databricks notebook (Scala) that reads 2 files and writes a new file back into the Data Lake.我有一个 Azure Data Lake gen1 和一个 Azure Data Lake gen2(带分层结构的 Blob 存储),我正在尝试创建一个 Databricks 笔记本(Scala),它读取 2 个文件并将一个新文件写回 Data Lake。 In both Gen1 and Gen2 I am experiencing the same issue where the file name of the output csv I have specified is getting saved as a directory and inside that directory it's writing 4 files " committed , started , _SUCCESS, and part-00000-tid-在 Gen1 和 Gen2 中,我遇到了相同的问题,其中我指定的输出 csv 的文件名被保存为一个目录,并且在该目录中写入 4 个文件“已提交、已启动、_SUCCESS 和 part-00000-tid-

数据块输出截图

For the life of me, I can't figure out why it's doing it and not actually saving the csv to the location.对于我的生活,我无法弄清楚为什么要这样做而不是将 csv 实际保存到该位置。 Here's an example of the code I've written.这是我编写的代码的示例。 If I do a .show() on the df_join dataframe then it outputs the correct looking results.如果我在 df_join 数据帧上执行 .show() ,那么它会输出正确的外观结果。 But the .write is not working correctly.但是 .write 不能正常工作。

val df_names = spark.read.option("header", "true").csv("/mnt/datalake/raw/names.csv")
val df_addresses = spark.read.option("header", "true").csv("/mnt/datalake/raw/addresses.csv")

val df_join = df_names.join(df_addresses, df_names.col("pk") === df_addresses.col("namepk"))


df_join.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save("/mnt/datalake/reports/testoutput.csv")

The reason why it's creating a directory with multiple files, is because each partition is saved and written to the data lake individually.之所以创建包含多个文件的目录,是因为每个分区都单独保存并写入数据湖。 To save a single output file you need to re partition your dataframe要保存单个输出文件,您需要重新分区数据帧

Let's use the dataframe API让我们使用数据框 API

confKey = "fs.azure.account.key.srcAcctName.blob.core.windows.net"
secretKey = "==" #your secret key
spark.conf.set(confKey,secretKey)
blobUrl = 'wasbs://MyContainerName@srcAcctName.blob.core.windows.net'

Coalesce your dataframe合并您的数据框

df_join.coalesce(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save("blobUrl" + "/reports/")

Change the file name更改文件名

files = dbutils.fs.ls(blobUrl + '/reports/')
output_file = [x for x in files if x.name.startswith("part-")]
dbutils.fs.mv(output_file[0].path, "%s/reports/testoutput.csv" % (blobUrl))

If I understand for your needs correctly, you just want to write the Spark DataFrame data to a single csv file named testoutput.csv into Azure Data Lake, not a directory named testoutput.csv with some partition files.如果我正确理解您的需求,您只想将 Spark DataFrame 数据写入到 Azure Data Lake 中名为testoutput.csv的单个 csv 文件,而不是包含一些分区文件的名为testoutput.csv的目录。

So you can not directly realize it via use these Spark functions like DataFrameWriter.save , because actually the dataframe writer writes data to HDFS based on Azure Data Lake.所以你不能直接通过使用像DataFrameWriter.save这样的 Spark 函数来实现它,因为实际上DataFrameWriter.save writer 基于 Azure Data Lake 将数据写入 HDFS。 The HDFS persists data as a directory named yours and some partition files. HDFS 将数据保存为名为 yours 的目录和一些分区文件。 Please see some documents about HDFS like The Hadoop FileSystem API Definition to know it.请参阅有关 HDFS 的一些文档,例如The Hadoop FileSystem API Definition以了解它。

Then, per my experience, you can try to use Azure Data Lake SDK for Jave within your Scala program to directly write data from DataFrame to Azure Data Lake as a single file.然后,根据我的经验,您可以尝试在 Scala 程序中使用 Azure Data Lake SDK for Jave 将数据从 DataFrame 作为单个文件直接写入 Azure Data Lake。 And you can refer to some samples https://github.com/Azure-Samples?utf8=%E2%9C%93&q=data-lake&type=&language=java .您可以参考一些示例https://github.com/Azure-Samples?utf8=%E2%9C%93&q=data-lake&type=&language=java

尝试这个 :

df_join.to_csv('/dbfs/mnt/....../df.csv', sep=',', header=True, index=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Scala 访问 Azure Data Lake Storage gen2 - Accessing Azure Data Lake Storage gen2 from Scala 直接使用来自 Databricks 的原始 SQL 查询存储在 Azure Data Lake 中的 avro 数据文件 - Querying avro data files stored in Azure Data Lake directly with raw SQL from Databricks 尝试枚举目录时,Databricks无法访问Data Lake Gen1 - Databricks fails accessing a Data Lake Gen1 while trying to enumerate a directory 如何从 databricks 笔记本中捕获日志并将它们存储到数据湖 gen2 - How to capture logs from databricks notebook and store them to data lake gen2 重命名写入的 CSV 文件 Spark 抛出错误“路径必须是绝对的”-Azure Data Lake - Rename written CSV file Spark throws Error “Path must be absolute” - Azure Data Lake 使用 Spark 2.4 从 Azure Data Lake Storage V2 读取文件 - Reading file from Azure Data Lake Storage V2 with Spark 2.4 数据块中表中的 Azure 存储帐户文件详细信息 - Azure Storage Account file details in a table in databricks 写入增量表时检测到架构不匹配 - Azure Databricks - A schema mismatch detected when writing to the Delta table - Azure Databricks Azure Datalake Store Gen2 使用 scala spark 库从 Databricks 读取文件 - Azure Datalake Store Gen2 read files from Databricks using a scala spark library 从数据湖导出到 Azure SLQ 服务器数据库时出现问题 - Having issues exporting from data lake to Azure SLQ Server DB
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM