[英]How to save a dataframe into multiple files based on unique columns in spark-scala
我有需要根據列來源和目的地划分的inputDf並將每個唯一組合保存到不同的 csv 文件中。
(使用 Spark 2.4.4)
val spark: SparkSession = SparkSession.builder().appName("Test").getOrCreate()
val inputRdd: RDD[(String, String, String, String, String, String)] = spark.sparkContext.parallelize(Seq(
("City1", "City2", "Sedan", "AE1235", "80", "2020-02-01"),
("City2", "City3", "Hatchback", "XY5434", "100", "2020-02-01"),
("City3", "City1", "Sedan", "YU3456", "120", "2020-02-01"),
("City3", "City2", "Sedan", "BV3555", "105", "2020-02-01"),
("City2", "City1", "SUV", "PO1234", "75", "2020-02-01"),
("City1", "City3", "SUV", "TY4123", "125", "2020-02-01"),
("City1", "City2", "Hatchback", "VI3415", "85", "2020-02-01"),
("City1", "City2", "SUV", "VF1244", "84", "2020-02-01"),
("City3", "City1", "Sedan", "EW1248", "124", "2020-02-01"),
("City2", "City1", "Hatchback", "GE576", "82", "2020-02-01"),
("City3", "City2", "Sedan", "PK2144", "104", "2020-02-01"),
("City3", "City1", "Hatchback", "PJ1244", "118", "2020-02-01"),
("City3", "City2", "SUV", "WF0976", "98", "2020-02-01"),
("City1", "City2", "Sedan", "WE876", "78", "2020-02-01"),
("City2", "City1", "Hatchback", "AB5467", "80", "2020-02-01")
))
val inputDf = spark.createDataFrame(inputRdd).toDF("origin", "destination", "vehicleType", "uniqueId", "distanceTravelled", "date")
示例輸出:
.csv 文件 1:
origin,destination,vehicleType,uniqueId,distanceTravelled,date
City1,City2,Sedan,AE1235,80,2020-02-01
City1,City2,Hatchback,VI3415,85,2020-02-01
City1,City2,SUV,VF1244,84,2020-02-01
City1,City2,Sedan,WE876,78,2020-02-01
.csv 文件 2:
origin,destination,vehicleType,uniqueId,distanceTravelled,date
City3,City1,Sedan,YU3456,120,2020-02-01
City3,City1,Sedan,EW1248,124,2020-02-01
City3,City1,Hatchback,PJ1244,118,2020-02-01
.csv 文件 3:
origin,destination,vehicleType,uniqueId,distanceTravelled,date
City2,City1,SUV,PO1234,75,2020-02-01
City2,City1,Hatchback,GE576,82,2020-02-01
City2,City1,Hatchback,AB5467,80,2020-02-01
到目前為止,我已經嘗試將唯一的組合放入一個元組中,然后在其上使用 foreach,每次將過濾后的數據幀保存到 csv 時都過濾 inputDf
val tuple = inputDf.groupBy("origin","destination").count()
.select("origin","destination").rdd.map(r => (r(0),r(1))).collect
tuple.foreach(row => {
val origin = row._1
val destination = row._2
val dataToWrite = inputDf.filter(inputDf.col("origin").equalTo(origin) && inputDf.col("destination").equalTo(destination))
dataToWrite.repartition(1).write.mode("overwrite").format("csv").option("header", "true").save("/path/to/output/folder/" + origin + "-" + destination + ".csv")
})
這種方法需要很多時間,因為它涉及每次過濾 inputDf,因為唯一組合的數量非常大。 這樣做的最佳方法是什么?
編輯:每個 inputDf 將只有一個日期的數據。
輸出應包含日期級別的文件。
喜歡:
/output/City1-City2/2020-02-01.csv
/output/City1-City2/2020-02-02.csv
/輸出/City1-City2/2020-02-03.csv
/output/City3-City1/2020-02-01.csv
/output/City3-City1/2020-02-02.csv
... 等等
您可以使用partitionBy
並根據您的組合在單獨的 csv 文件中划分數據。 我使用了coalesce
將所有數據保存在一個csv 文件中,如果您有大量數據,不建議這樣做。 執行下面的代碼,它將所有可能的組合寫入單獨的 csv 文件。
scala> df.show()
+------+-----------+-----------+--------+-----------------+----------+
|origin|destination|vehicleType|uniqueId|distanceTravelled| date|
+------+-----------+-----------+--------+-----------------+----------+
| City1| City2| Sedan| AE1235| 80|2020-02-01|
| City2| City3| Hatchback| XY5434| 100|2020-02-01|
| City3| City1| Sedan| YU3456| 120|2020-02-01|
| City3| City2| Sedan| BV3555| 105|2020-02-01|
| City2| City1| SUV| PO1234| 75|2020-02-01|
| City1| City3| SUV| TY4123| 125|2020-02-01|
| City1| City2| Hatchback| VI3415| 85|2020-02-02|
| City1| City2| SUV| VF1244| 84|2020-02-02|
| City3| City1| Sedan| EW1248| 124|2020-02-02|
| City2| City1| Hatchback| GE576| 82|2020-02-02|
| City3| City2| Sedan| PK2144| 104|2020-02-02|
| City3| City1| Hatchback| PJ1244| 118|2020-02-02|
| City3| City2| SUV| WF0976| 98|2020-02-02|
| City1| City2| Sedan| WE876| 78|2020-02-02|
| City2| City1| Hatchback| AB5467| 80|2020-02-02|
+------+-----------+-----------+--------+-----------------+----------+
scala> val df1 = df.withColumn("combination", concat(col("origin") ,lit("-"), col("destination")))
scala> df1.coalesce(1).write.partitionBy("combination","date").format("csv").option("header", "true").mode("overwrite").save("/stackOut/")
輸出將類似於:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.