简体   繁体   English

spark 在 S3 上的分区内创建分区

[英]spark creates partition inside partition on S3

I have below tab delimited sample dataset:我有以下制表符分隔的示例数据集:

col1  period  col3  col4  col5  col6  col7  col8  col9  col10 col11 col12 col13 col14 col15 col16 col17 col18 col19 col20 col21 col22
ASSDF 202001  A B BFGF  SSDAA WDSF  SDSDSD  SDSDSSS SDSDSD  E F FS  E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99
ASSDF 202002  A B BFGF  SSDAA WDSF  SDSDSD  SDSDSSS SDSDSD  E F FS  E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99
ASSDF 202003  A B BFGF  SSDAA WDSF  SDSDSD  SDSDSSS SDSDSD  E F FS  E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99
ASSDF 202004  A B BFGF  SSDAA WDSF  SDSDSD  SDSDSSS SDSDSD  E F FS  E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99
...
...
ASSDF 202312  A B BFGF  SSDAA WDSF  SDSDSD  SDSDSSS SDSDSD  E F FS  E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99

I am running some transformation on this data and final data is in spark dataset "DS1" .我正在对此数据进行一些转换,最终数据位于 spark 数据集"DS1"中。 After that I am writing that dataset to s3 with "period" partition.之后,我将该数据集写入带有“period”分区的 s3。 Since I want period in s3 files as well, I am creating another column "datasetPeriod" from from period column.因为我也想要 s3 文件中的句点,所以我从句点列创建了另一列“datasetPeriod”。

My scala function to save TSV dataset.我的 scala function 保存TSV数据集。

def saveTsvDataset(dataframe: DataFrame, outputFullPath: String, numPartitions: Integer, partitionCols: String*): Unit = {
    dataframe
      .repartition(numPartitions)
      .write
      .partitionBy(partitionCols:_*)
      .mode(SaveMode.Overwrite)
      .option("sep", "\t")
      .csv(outputFullPath)
  }

Scala code to save dataset on s3. Scala 将数据集保存在 s3 上的代码。 Adding new column datasetPeriod for partition on s3.为 s3 上的分区添加新列 datasetPeriod。

 saveTsvDataset(
      DS1.withColumn("datasetPeriod",$"period")
      , "s3://s3_path"
      , 100
      , "period"
    )

Now, my problem is I have period from 202001 to 202312 and when I am writing on s3 with partition on "datasetPeriod" sometimes it creates partition inside partition for any random period.现在,我的问题是我有从 202001 到 202312 的时间段,当我在 s3 上写“datasetPeriod”分区时,有时它会在任何随机时间段内创建分区。 So this happens randomly for any period.所以这在任何时期都是随机发生的。 I never see this happened for multiple periods.我从未见过这种情况在多个时期发生过。 It creates path like "s3://s3_path/datasetPeriod=202008/datasetPeriod=202008" .它创建类似"s3://s3_path/datasetPeriod=202008/datasetPeriod=202008"的路径。

You already have a period column in your DataFrame. So no need to create one more new duplicate datasetPeriod column.您的 DataFrame 中已经有一个period列。因此无需再创建一个新的重复datasetPeriod列。

When you write DataFrame to a s3://../parentFolder using .partitionBy("period") it creates folders like below:当您使用.partitionBy("period")将 DataFrame 写入s3://../parentFolder时,它会创建如下文件夹:

df.write.partitionBy("period").csv("s3://../parentFolder/")
s3://.../parentFolder/period=202001/
s3://.../parentFolder/period=202002/
s3://.../parentFolder/period=202003/
...
s3://.../parentFolder/period=202312/

While reading the data back, just mention the path till parentFolder only, that will automatically read period as one of the columns.在读回数据时,只需提及到parentFolder的路径,它会自动读取period点作为列之一。

val df = spark.read.csv("s3://../parentFolder/")
//df.schema will give you `period` as one of the column
df.printSchema
root
 |-- col1: string (nullable = true)
 |-- .... //other columns go here
 |-- period: string (nullable = true)

That being said, whatever multiple partition inside partition column you are getting are only due to the wrong path you are using while writing data using partitionBy.话虽这么说,无论您在分区列中得到多少个分区,都只是因为您在使用 partitionBy 写入数据时使用了错误的路径。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM