spark orderBy combine few groups

Question

I have data like this:

Date	Ident
01.02.2002	AAA_111111
01.02.2002	BBB_222222

I count duplicates and write result in csv files my code like this:

(df.groupBy("Date", "Ident")
.agg(functions.count("*")))
.orderBy(functions.to_date(functions.column("Date"), "dd.MM.yyyy").cast(DateType).asc)
.write.format("csv").save(pathResult)

If df approximate size 100 lines, each file contains data about one date. Like this:

02.05.2020,AAA_111111,1
02.05.2020,AAA_111112,1
02.05.2020,AAA_111113,2
02.05.2020,AAA_111114,1
02.05.2020,AAA_111115,1

If df approximate size 10000 lines, each file contains data about few date. Like this:

02.05.2020,AAA_111111,1
02.05.2020,AAA_111112,1
.......................
03.05.2020,AAB_111113,2
03.05.2020,AAB_111114,1
.......................
04.05.2020,AAC_111115,1

Can use partitionBy("Date") , but this will create separate folders for each day, and remove the "Date" data from the csv

Is it possible to write data about only one "Date" to one file, without using partitionBy() ?

I would like to get data about only one date in one file for any df size.

Answer 1

You can try repartitioning on the date column with k partitions, where k is the distinct number of dates. Something like

val numPartitions = df.agg(countDistinct("Date")).first.getAs[Long](0).toInt

(df.groupBy("Date", "Ident")
.agg(functions.count("*")))
.orderBy(functions.to_date(functions.column("Date"), "dd.MM.yyyy").cast(DateType).asc)
.repartition(numPartitions, col("Date"))
.write.format("csv").save(pathResult)

This way, you'll end up with k partitions written to k files. However, there isn't a guarantee that each file will have only one date, as repartition uses HashPartitioner . All rows with the same date will be in the same file, but in the event of a hash collision, you'll get a file with more than one date and an empty file. You can decide whether or not this is acceptable for your use-case.

If that isn't acceptable, you'll probably have to do some post-processing on the files. Unfortunately, Spark doesn't give you a lot of options for how the file output is structured.

spark orderBy combine few groups

Question

1 answers

solution1
0 2022-05-18 02:34:23

spark orderBy combine few groups

Question

1 answers

solution1 0 2022-05-18 02:34:23

solution1
0 2022-05-18 02:34:23