[英]Azure Databricks writing a file into Azure Data Lake Gen 2
I have an Azure Data Lake gen1 and an Azure Data Lake gen2 (Blob Storage w/hierarchical) and I am trying to create a Databricks notebook (Scala) that reads 2 files and writes a new file back into the Data Lake.我有一个 Azure Data Lake gen1 和一个 Azure Data Lake gen2(带分层结构的 Blob 存储),我正在尝试创建一个 Databricks 笔记本(Scala),它读取 2 个文件并将一个新文件写回 Data Lake。 In both Gen1 and Gen2 I am experiencing the same issue where the file name of the output csv I have specified is getting saved as a directory and inside that directory it's writing 4 files " committed , started , _SUCCESS, and part-00000-tid-
在 Gen1 和 Gen2 中,我遇到了相同的问题,其中我指定的输出 csv 的文件名被保存为一个目录,并且在该目录中写入 4 个文件“已提交、已启动、_SUCCESS 和 part-00000-tid-
For the life of me, I can't figure out why it's doing it and not actually saving the csv to the location.对于我的生活,我无法弄清楚为什么要这样做而不是将 csv 实际保存到该位置。 Here's an example of the code I've written.
这是我编写的代码的示例。 If I do a .show() on the df_join dataframe then it outputs the correct looking results.
如果我在 df_join 数据帧上执行 .show() ,那么它会输出正确的外观结果。 But the .write is not working correctly.
但是 .write 不能正常工作。
val df_names = spark.read.option("header", "true").csv("/mnt/datalake/raw/names.csv")
val df_addresses = spark.read.option("header", "true").csv("/mnt/datalake/raw/addresses.csv")
val df_join = df_names.join(df_addresses, df_names.col("pk") === df_addresses.col("namepk"))
df_join.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save("/mnt/datalake/reports/testoutput.csv")
The reason why it's creating a directory with multiple files, is because each partition is saved and written to the data lake individually.之所以创建包含多个文件的目录,是因为每个分区都单独保存并写入数据湖。 To save a single output file you need to re partition your dataframe
要保存单个输出文件,您需要重新分区数据帧
Let's use the dataframe API让我们使用数据框 API
confKey = "fs.azure.account.key.srcAcctName.blob.core.windows.net"
secretKey = "==" #your secret key
spark.conf.set(confKey,secretKey)
blobUrl = 'wasbs://MyContainerName@srcAcctName.blob.core.windows.net'
Coalesce your dataframe合并您的数据框
df_join.coalesce(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.mode("overwrite")
.save("blobUrl" + "/reports/")
Change the file name更改文件名
files = dbutils.fs.ls(blobUrl + '/reports/')
output_file = [x for x in files if x.name.startswith("part-")]
dbutils.fs.mv(output_file[0].path, "%s/reports/testoutput.csv" % (blobUrl))
If I understand for your needs correctly, you just want to write the Spark DataFrame data to a single csv file named testoutput.csv
into Azure Data Lake, not a directory named testoutput.csv
with some partition files.如果我正确理解您的需求,您只想将 Spark DataFrame 数据写入到 Azure Data Lake 中名为
testoutput.csv
的单个 csv 文件,而不是包含一些分区文件的名为testoutput.csv
的目录。
So you can not directly realize it via use these Spark functions like DataFrameWriter.save
, because actually the dataframe writer writes data to HDFS based on Azure Data Lake.所以你不能直接通过使用像
DataFrameWriter.save
这样的 Spark 函数来实现它,因为实际上DataFrameWriter.save
writer 基于 Azure Data Lake 将数据写入 HDFS。 The HDFS persists data as a directory named yours and some partition files. HDFS 将数据保存为名为 yours 的目录和一些分区文件。 Please see some documents about HDFS like
The Hadoop FileSystem API Definition
to know it.请参阅有关 HDFS 的一些文档,例如
The Hadoop FileSystem API Definition
以了解它。
Then, per my experience, you can try to use Azure Data Lake SDK for Jave within your Scala program to directly write data from DataFrame to Azure Data Lake as a single file.然后,根据我的经验,您可以尝试在 Scala 程序中使用 Azure Data Lake SDK for Jave 将数据从 DataFrame 作为单个文件直接写入 Azure Data Lake。 And you can refer to some samples https://github.com/Azure-Samples?utf8=%E2%9C%93&q=data-lake&type=&language=java .
您可以参考一些示例https://github.com/Azure-Samples?utf8=%E2%9C%93&q=data-lake&type=&language=java 。
尝试这个 :
df_join.to_csv('/dbfs/mnt/....../df.csv', sep=',', header=True, index=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.