I have below tab delimited sample dataset:
col1 period col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 col15 col16 col17 col18 col19 col20 col21 col22
ASSDF 202001 A B BFGF SSDAA WDSF SDSDSD SDSDSSS SDSDSD E F FS E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99
ASSDF 202002 A B BFGF SSDAA WDSF SDSDSD SDSDSSS SDSDSD E F FS E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99
ASSDF 202003 A B BFGF SSDAA WDSF SDSDSD SDSDSSS SDSDSD E F FS E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99
ASSDF 202004 A B BFGF SSDAA WDSF SDSDSD SDSDSSS SDSDSD E F FS E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99
...
...
ASSDF 202312 A B BFGF SSDAA WDSF SDSDSD SDSDSSS SDSDSD E F FS E CURR1 CURR2 -99 CURR3 -99 -99 -99 -99
I am running some transformation on this data and final data is in spark dataset "DS1"
. After that I am writing that dataset to s3 with "period" partition. Since I want period in s3 files as well, I am creating another column "datasetPeriod" from from period column.
My scala function to save TSV dataset.
def saveTsvDataset(dataframe: DataFrame, outputFullPath: String, numPartitions: Integer, partitionCols: String*): Unit = {
dataframe
.repartition(numPartitions)
.write
.partitionBy(partitionCols:_*)
.mode(SaveMode.Overwrite)
.option("sep", "\t")
.csv(outputFullPath)
}
Scala code to save dataset on s3. Adding new column datasetPeriod for partition on s3.
saveTsvDataset(
DS1.withColumn("datasetPeriod",$"period")
, "s3://s3_path"
, 100
, "period"
)
Now, my problem is I have period from 202001 to 202312 and when I am writing on s3 with partition on "datasetPeriod" sometimes it creates partition inside partition for any random period. So this happens randomly for any period. I never see this happened for multiple periods. It creates path like "s3://s3_path/datasetPeriod=202008/datasetPeriod=202008"
.
You already have a period
column in your DataFrame. So no need to create one more new duplicate datasetPeriod
column.
When you write DataFrame to a s3://../parentFolder
using .partitionBy("period")
it creates folders like below:
df.write.partitionBy("period").csv("s3://../parentFolder/")
s3://.../parentFolder/period=202001/
s3://.../parentFolder/period=202002/
s3://.../parentFolder/period=202003/
...
s3://.../parentFolder/period=202312/
While reading the data back, just mention the path till parentFolder
only, that will automatically read period
as one of the columns.
val df = spark.read.csv("s3://../parentFolder/")
//df.schema will give you `period` as one of the column
df.printSchema
root
|-- col1: string (nullable = true)
|-- .... //other columns go here
|-- period: string (nullable = true)
That being said, whatever multiple partition inside partition column you are getting are only due to the wrong path you are using while writing data using partitionBy.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.