一起使用partitionBy和合并

Question

我需要根据特定的Partition键将数据写入s3，这可以通过使用write.partitionBy轻松完成。 但是，在这种情况下，我只需要在每个路径中写入一个文件。 我正在使用下面的代码来做到这一点。

    orderFlow.coalesce(1).write.partitionBy("SellerYearMonthWeekKey")
      .mode(SaveMode.Overwrite)
      .format("com.databricks.spark.csv")
      .option("delimiter", ",")
      .option("header", "true")
      .save(outputS3Path + "/")

您能以最好的方法帮助我吗？ 在上述情况下，我收到了OutOfMemmory错误。

Answer 1

如果要每个分区输出一个文件，则可以按partitionBy中使用的同一列对数据集重新partitionBy

   orderFlow.repartition("SellerYearMonthWeekKey")
      .write.partitionBy("SellerYearMonthWeekKey")
      .mode(SaveMode.Overwrite)
      .format("com.databricks.spark.csv")
      .option("delimiter", ",")
      .option("header", "true")
      .save(outputS3Path + "/")

这将使您花费大量费用，但可以保证每个分区目录只有一个文件。

Answer 2

我认为，这可以帮助您-

object Stackoverflow1 {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("Test").master("local").getOrCreate()

    val rdd = orderFlow.rdd.map(a => (a.getAs[String]("SellerYearMonthWeekKey"),a.toSeq.mkString(",")))

    val outputPath = "<S3_Location>"

      rdd.saveAsHadoopFile(outputPath, classOf[String], classOf[String],
      classOf[CustomMultipleTextOutputFormat])


  }

  class CustomMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
    override def generateActualKey(key: Any, value: Any): Any =
      NullWritable.get()

    override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
      key.asInstanceOf[String]
  }

}

一起使用partitionBy和合并

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-07-19 10:41:21

解决方案2
0 2019-07-19 10:55:54

一起使用partitionBy和合并

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-07-19 10:41:21

解决方案2 0 2019-07-19 10:55:54

解决方案1
2 已采纳 2019-07-19 10:41:21

解决方案2
0 2019-07-19 10:55:54