如何在Spark SCALA中重命名AWS中的spark数据框输出文件

Question

I am saving my spark data frame output as csv file in scala with partitions. 我将我的火花数据帧输出保存为带分区的scala中的csv文件。 This is how i do that in Zeppelin . 这就是我在Zeppelin中的表现 。

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._
    import org.apache.spark.{ SparkConf, SparkContext }
    import java.sql.{Date, Timestamp}
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions.udf

import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract

val get_cus_val = spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))

val rdd = sc.textFile("s3://trfsmallfffile/FinancialLineItem/MAIN")
val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)

val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)

val df1resultFinal=data.withColumn("DataPartition", get_cus_val(input_file_name))
val rdd1 = sc.textFile("s3://trfsmallfffile/FinancialLineItem/INCR")
val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)


import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc) 
val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")


val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
      .select($"LineItem_organizationId", $"LineItem_lineItemId",
        when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition").as("DataPartition"),
        when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
        when($"FinancialConceptLocalId_1".isNotNull, $"FinancialConceptLocalId_1").otherwise($"FinancialConceptLocalId").as("FinancialConceptLocalId"),
        when($"FinancialConceptGlobalId_1".isNotNull, $"FinancialConceptGlobalId_1").otherwise($"FinancialConceptGlobalId").as("FinancialConceptGlobalId"),
        when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
        when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction|!|").as("FFAction|!|"))
        .filter(!$"FFAction|!|".contains("D|!|"))

val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition",$"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))

val headerColumn = dataHeader.columns.toSeq

val header = headerColumn.mkString("", "|^|", "|!|").dropRight(3)

val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "|^|null", "")).withColumnRenamed("concatenated", header)


dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("delimiter", "\t")
  .option("quote", "\u0000")
  .option("header", "true")
  .option("codec", "gzip")
  .save("s3://trfsmallfffile/FinancialLineItem/output")

  val FFRowCount =dfMainOutputFinalWithoutNull.groupBy("DataPartition","StatementTypeCode").count

  FFRowCount.coalesce(1).write.format("com.databricks.spark.xml")
  .option("rootTag", "FFFileType")
  .option("rowTag", "FFPhysicalFile")
  .save("s3://trfsmallfffile/FinancialLineItem/Descr")

Now files are saved in partitioned folder structure which is expected . 现在文件保存在预期的分区文件夹结构中。

Now my requiremen is to rename all the part file and save it in one directory . 现在我的要求是重命名所有零件文件并将其保存在一个目录中。 The name of the file will be as the name of the folder structure . 该文件的名称将作为文件夹结构的名称。

For example i have one file saved in folder/DataPartition=Japan/PartitionYear=1971/part-00001-87a61115-92c9-4926-a803-b46315e55a08.c000.csv.gz 例如，我有一个文件保存在folder/DataPartition=Japan/PartitionYear=1971/part-00001-87a61115-92c9-4926-a803-b46315e55a08.c000.csv.gz

Now i want my file name to be 现在我想要我的文件名

Japan.1971.1.txt.gz
Japan.1971.2.txt.gz

I have done this in java map-reduce after my job is completed then i was reading HDFS files system and then moved it into different location as renamed file name . 我的工作完成之后，我已经在java map-reduce中完成了这个，然后我正在阅读HDFS文件系统，然后将其移动到不同的位置作为重命名的文件名。

But how do to the this in AWS S3 files system in spark SCALA . 但是如何在Spark SCALA中的AWS S3文件系统中实现这一点。

As far as i have research there is no direct way to rename spark data frame output file name. 据我所知，没有直接的方法来重命名火花数据帧输出文件名。

But there is implementation that can be done in the job itself using MultipleOutputs as saveAsHadoopFile but how to do that ?. 但是有一个实现可以在作业本身使用MultipleOutputs作为saveAsHadoopFile但是如何做到这一点？

I am looking for some sample code in scala 我在scala中寻找一些示例代码

It is as like after completing job we need to read the file from s3,reame it and move it to some other location . 就像完成工作后我们需要从s3读取文件，重新命名并将其移动到其他位置。

Answer 1

val tempOutPath = "mediamath.dir"
headerDf.union(outDf)
  .repartition(1)
  .write
  .mode(SaveMode.Overwrite)
  .format("text")
  .option("codec", "gzip")
  .save(tempOutPath)

import org.apache.hadoop.fs._
val sc = spark.sparkContext
val fs = FileSystem.get(sc.hadoopConfiguration)
val file = fs.globStatus(new Path("mediamath.dir/part*.gz"))(0).getPath.getName

fs.rename(new Path("mediamath.dir/" + file), new Path(<aws-s3-path>))

Here is my code snippet please see if this helps you. 这是我的代码片段，请看看这是否对您有所帮助。

Answer 2

AFAIK, If you are looking to rename file/object in S3 bucket directly, It's not possible. AFAIK，如果你想直接在S3存储桶中重命名文件/对象， 那是不可能的。

You can achieve the rename = copy to target + delete source 您可以实现rename = copy to target + delete source

First let's extract the filename from source 首先让我们从源文件中提取文件名

def prepareNewFilename(oldFilename: String) = {

  val pattern = raw".*/DataPartition=%s/PartitionYear=%s/part-%s.*\.%s"
    .format("([A-Za-z]+)", "([0-9]+)", "([0-9]+)", "([a-z]+)")
    .r

  val pattern(country, year, part, extn) = oldFilename

  "%s.%s.%s.%s.%s".format(country, year, part, "txt", extn)
} 

val oldFilename = "folder/DataPartition=Japan/PartitionYear=1971/part-00001-87a61115-92c9-4926-a803-b46315e55a08.c000.csv.gz"

val newFilename = prepareNewFilename(oldFilename)
//newFilename: String = Japan.1971.00001.txt.gz

Code to rename the file/object in S3 in bucket 用于在存储桶中重命名S3中的文件/对象的代码

import com.amazonaws.AmazonServiceException
import com.amazonaws.services.s3.AmazonS3ClientBuilder

val s3 = AmazonS3ClientBuilder.defaultClient()

try {
  s3.copyObject(sourceBkt, oldFilename, targetBkt, newFilename)
  s3.deleteObject(sourceBkt, oldFilename)
} catch {
  case e: AmazonServiceException =>
    System.err.println(e.getErrorMessage)
    System.exit(1)
}

如何在Spark SCALA中重命名AWS中的spark数据框输出文件

问题描述

2 个解决方案

解决方案1
2 2017-11-30 05:04:19

解决方案2
1 2018-01-17 08:18:02

First let's extract the filename from source 首先让我们从源文件中提取文件名

Code to rename the file/object in S3 in bucket 用于在存储桶中重命名S3中的文件/对象的代码

如何在Spark SCALA中重命名AWS中的spark数据框输出文件

问题描述

2 个解决方案

解决方案1 2 2017-11-30 05:04:19

解决方案2 1 2018-01-17 08:18:02

First let's extract the filename from source 首先让我们从源文件中提取文件名

Code to rename the file/object in S3 in bucket 用于在存储桶中重命名S3中的文件/对象的代码

解决方案1
2 2017-11-30 05:04:19

解决方案2
1 2018-01-17 08:18:02