簡體   English   中英

使用python和scala重命名Azure / databricks的輸出文件

[英]Using python and scala to rename output file from azure/databricks

我正在嘗試從隨機字符的python默認值中將輸出文件重命名為更明智的名稱,其中包含日期/時間以使文件名具有唯一性

這是我使用的代碼。 python將文件發送到共享驅動器,但名稱不可用。 我嘗試在python代碼中搜索某種重命名文件的方法,但是失敗了。 然后,我開始查看scala,盡管它可以滿足我的要求,但幾乎可以做到。 似乎運行正常,但未生成輸出文件,可能與開發人員有關,例如Me !!。

任何幫助,將不勝感激

%python
try:
  dfsql = spark.sql("select * from dbsmets1mig02_technical_build.tbl_Temp_Output_CS_Notes_Final order by record1") #Replace with your SQL
except:
  print("Exception occurred")
if dfsql.count() == 0:
  print("No data rows")
else:
  dfsql.coalesce(1).write.format("com.databricks.spark.csv").option("quote", "").option("header","false").option("delimiter","|").mode("overwrite").save(
"/mnt/publisheddatasmets1mig/metering/smets1mig/cs/system_data_build/notes/outbound/")    

%scala
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

def merge(srcPath: String, dstPath: String): Unit =  {
   val hadoopConfig = new Configuration()
   val hdfs = FileSystem.get(hadoopConfig)
   FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null) 
   // the "true" setting deletes the source files once they are merged into the new outputfile
}


// replace newdata, outputfile and filename values with preferred values
val dfsql = sqlContext.sql("select * from dbsmets1mig02_technical_build.tbl_Temp_Output_CS_Notes_Final order by record1") //SQL here

val outputfile = "/mnt/publisheddatasmets1mig/metering/smets1mig/cs/system_data_build/notes/outbound"  //PATH names here

var filename = "CS_Notes"  //Filename here
var fileext = ".csv"

//val dateFormat = "yyyyMMdd_HHmm"
val dateFormat = "dd-MM-yyyy_HH-mm-ss"
val dateValue = spark.range(1).select(date_format(current_timestamp,dateFormat)).as[(String)].first

filename = filename + "_" + dateValue
var outputFileName = outputfile + "/" + filename + fileext
var mergedFileName = outputfile + "/" + filename + fileext
var mergeFindGlob  = outputFileName

dfsql.write.format("com.databricks.spark.csv").option("header", "false").option("delimiter", "|").option("quote","\u0000").mode("overwrite").mode("overwrite").save(outputFileName)
merge(mergeFindGlob, mergedFileName )
dfsql.unpersist()

創建單個大文件到路徑然后重命名,但是由於違反Hadoop分區概念,因此不建議這樣做。

val outputFilePath = "/mnt/data/output"
dfsql.coalesce(1).write.format("com.databricks.spark.csv").option("header", "false").option("delimiter", "|").option("quote","\u0000").mode("overwrite").mode("overwrite").save(outputFileName)
// /mnt/data/output/1234565125435.csv
val outputFileName = "/mnt/data/output/filename.csv"
//Rename /mnt/data/output/1234565125435.csv to /mnt/data/output/filename.csv
rename(outputFilePath, outputFileName)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM