繁体   English   中英

在DataFrameWriter上使用partitionBy会使用列名而不只是值来写入目录布局

[英]Using partitionBy on a DataFrameWriter writes directory layout with column names not just values

我正在使用Spark 2.0。

我有一个DataFrame。 我的代码如下所示:

df.write.partitionBy("year", "month", "day").format("csv").option("header", "true").save(s"s3://bucket/")

并且在程序执行时,它将以以下格式写入文件:

s3://bucket/year=2016/month=11/day=15/file.csv

如何将格式配置为:

s3://bucket/2016/11/15/file.csv

我也想知道是否可以配置文件名。

这是看起来很稀疏的相关文档...
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

partitionBy(colNames: String*): DataFrameWriter[T]
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:

year=2016/month=01/
year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.

This was initially applicable for Parquet but in 1.5+ covers JSON, text, ORC and avro as well.

这是预期的和期望的行为。 Spark使用目录结构进行分区发现和修剪,并且正确的结构(包括列名)是其正常工作所必需的。

您还必须记住,分区会删除用于分区的列。

如果需要特定的目录结构,则应使用下游过程重命名目录。

您可以使用以下脚本来中继目录的名称:

#!/usr/bin/env bash

# Rename repartition folder: delete COLUMN=, e.g. DATE=20170708 to 20170708.

path=$1
col=$2
for f in `hdfs dfs -ls $ | awk '{print $NF}' | grep $col=`; do
    a="$(echo $f | sed s/$col=//)"
    hdfs dfs -mv "$f" "$a"
done

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM