[英]How in Spark application create CSV file from DataFrame (Scala)?
我的下一個問題不是新問題,但我想了解如何逐步進行。
在Spark應用程序中,我創建DataFrame。 讓我們稱之為df
。 Spark版本: 2.4.0
val df: DataFrame = Seq(
("Alex", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "OUT"),
("Bob", "2018-02-01 00:00:00", "2018-02-05 00:00:00", "IN"),
("Mark", "2018-02-01 00:00:00", "2018-03-01 00:00:00", "IN"),
("Mark", "2018-05-01 00:00:00", "2018-08-01 00:00:00", "OUT"),
("Meggy", "2018-02-01 00:00:00", "2018-02-01 00:00:00", "OUT")
).toDF("NAME", "START_DATE", "END_DATE", "STATUS")
如何從此DataFrame創建.csv
文件並將csv文件放入服務器中的特定文件夾?
例如,此代碼正確嗎? 我注意到有些人為此目的使用coalesce
或repartition
。 但我不知道哪種情況會更好。
union.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/home/reports/")
當我嘗試使用下一個代碼時,它將引發ERROR
:
org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/home/reports/_temporary/0":hdfs:hdfs:drwxr-xr-x
我以root
用戶身份運行Spark應用程序。 reports
root
用戶使用以下命令創建的文件夾:
mkdir -m 777 reports
似乎只有hdfs
用戶可以寫入文件。
我相信您對Spark的行為感到困惑,我建議您先閱讀官方文檔和/或一些教程。
不過,我希望這能回答您的問題。
此代碼會將DataFrame
保存為本地文件系統上的SINGLE CSV文件 。
它已在Ubuntu 18.04
筆記本電腦上使用Spark 2.4.0
和Scala 2.12.8
進行了測試。
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.appName("CSV Writter Test")
.getOrCreate()
import spark.implicits._
val df =
Seq(
("Alex", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "OUT"),
("Bob", "2018-02-01 00:00:00", "2018-02-05 00:00:00", "IN"),
("Mark", "2018-02-01 00:00:00", "2018-03-01 00:00:00", "IN"),
("Mark", "2018-05-01 00:00:00", "2018-08-01 00:00:00", "OUT"),
("Meggy", "2018-02-01 00:00:00", "2018-02-01 00:00:00", "OUT")
).toDF("NAME", "START_DATE", "END_DATE", "STATUS")
df.printSchema
// root
// |-- NAME: string (nullable = true)
// |-- START_DATE: string (nullable = true)
// |-- END_DATE: string (nullable = true)
// |-- STATUS: string (nullable = true)
df.coalesce(numPartitions = 1)
.write
.option(key = "header", value = "true")
.option(key = "sep", value = ",")
.option(key = "encoding", value = "UTF-8")
.option(key = "compresion", value = "none")
.mode(saveMode = "OVERWRITE")
.csv(path = "file:///home/balmungsan/dailyReport/") // Change the path. Note there are 3 /, the first two are for the file protocol, the third one is for the root folder.
spark.stop()
現在,讓我們檢查保存的文件。
balmungsan@BalmungSan:dailyReport $ pwd
/home/balmungsan/dailyReport
balmungsan@BalmungSan:dailyReport $ ls
part-00000-53a11fca-7112-497c-bee4-984d4ea8bbdd-c000.csv _SUCCESS
balmungsan@BalmungSan:dailyReport $ cat part-00000-53a11fca-7112-497c-bee4-984d4ea8bbdd-c000.csv
NAME,START_DATE,END_DATE,STATUS
Alex,2018-01-01 00:00:00,2018-02-01 00:00:00,OUT
Bob,2018-02-01 00:00:00,2018-02-05 00:00:00,IN
Mark,2018-02-01 00:00:00,2018-03-01 00:00:00,IN
Mark,2018-05-01 00:00:00,2018-08-01 00:00:00,OUT
Meggy,2018-02-01 00:00:00,2018-02-01 00:00:00,OUT
_SUCCESS
文件存在以指示寫入成功。
file://
協議以保存到本地文件系統,而不是HDFS中 。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.