apache-spark partitionBy：從目錄布局中刪除列名

Question

我有這樣的代碼：

val data1 = data.withColumn("local_date_time", toLocalDateUdf('timestamp))
data1
  .withColumn("year", year(col("local_date_time")))
  .withColumn("month", month(col("local_date_time")))
  .withColumn("day", dayofmonth(col("local_date_time")))
  .withColumn("hour", hour(col("local_date_time")))
  .drop("local_date_time")
  .write
  .mode("append")
  .partitionBy("year", "month", "day", "hour")
  .format("json")
  .save("s3a://path/")

它創建嵌套文件夾，例如 this year=2020 / month=5 / day=10是 S3（ year是列名， 2020是它的值）。 我想創建像2020 / 5 / 10這樣的嵌套文件夾。 如果我使用partitionBy方法，Spark 會將列名添加到目錄名。

這是來自 Spark 源代碼：

  /**
   * Partitions the output by the given columns on the file system. If specified, the output is
   * laid out on the file system similar to Hive's partitioning scheme. As an example, when we
   * partition a dataset by year and then month, the directory layout would look like:
   * <ul>
   * <li>year=2016/month=01/</li>
   * <li>year=2016/month=02/</li>
   * </ul>
   */
    @scala.annotation.varargs
    def partitionBy(colNames: String*): DataFrameWriter[T] = {
      this.partitioningColumns = Option(colNames)
      this
    }

如何從目錄布局中刪除列名？

Answer 1

.partitionBy("年"、"月"、"日"、"小時")

上面的命令允許您將其保存到帶有partition=value格式的分區的 parquet 中

這不是錯誤，它是標准的鑲木地板格式。

您可以遍歷每個分區並手動保存它

apache-spark partitionBy：從目錄布局中刪除列名

問題描述

1 個解決方案

解決方案1
0 2020-05-19 10:04:24

apache-spark partitionBy：從目錄布局中刪除列名

問題描述

1 個解決方案

解決方案1 0 2020-05-19 10:04:24

解決方案1
0 2020-05-19 10:04:24