简体   繁体   中英

spark orderBy combine few groups

I have data like this:

Date Ident
01.02.2002 AAA_111111
01.02.2002 BBB_222222

I count duplicates and write result in csv files my code like this:

(df.groupBy("Date", "Ident")
.agg(functions.count("*")))
.orderBy(functions.to_date(functions.column("Date"), "dd.MM.yyyy").cast(DateType).asc)
.write.format("csv").save(pathResult)

If df approximate size 100 lines, each file contains data about one date. Like this:

02.05.2020,AAA_111111,1
02.05.2020,AAA_111112,1
02.05.2020,AAA_111113,2
02.05.2020,AAA_111114,1
02.05.2020,AAA_111115,1

If df approximate size 10000 lines, each file contains data about few date. Like this:

02.05.2020,AAA_111111,1
02.05.2020,AAA_111112,1
.......................
03.05.2020,AAB_111113,2
03.05.2020,AAB_111114,1
.......................
04.05.2020,AAC_111115,1

Can use partitionBy("Date") , but this will create separate folders for each day, and remove the "Date" data from the csv

Is it possible to write data about only one "Date" to one file, without using partitionBy() ?

I would like to get data about only one date in one file for any df size.

You can try repartitioning on the date column with k partitions, where k is the distinct number of dates. Something like

val numPartitions = df.agg(countDistinct("Date")).first.getAs[Long](0).toInt

(df.groupBy("Date", "Ident")
.agg(functions.count("*")))
.orderBy(functions.to_date(functions.column("Date"), "dd.MM.yyyy").cast(DateType).asc)
.repartition(numPartitions, col("Date"))
.write.format("csv").save(pathResult)

This way, you'll end up with k partitions written to k files. However, there isn't a guarantee that each file will have only one date, as repartition uses HashPartitioner . All rows with the same date will be in the same file, but in the event of a hash collision, you'll get a file with more than one date and an empty file. You can decide whether or not this is acceptable for your use-case.

If that isn't acceptable, you'll probably have to do some post-processing on the files. Unfortunately, Spark doesn't give you a lot of options for how the file output is structured.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM