I have data like this:
Date | Ident |
---|---|
01.02.2002 | AAA_111111 |
01.02.2002 | BBB_222222 |
I count duplicates and write result in csv files my code like this:
(df.groupBy("Date", "Ident")
.agg(functions.count("*")))
.orderBy(functions.to_date(functions.column("Date"), "dd.MM.yyyy").cast(DateType).asc)
.write.format("csv").save(pathResult)
If df approximate size 100 lines, each file contains data about one date. Like this:
02.05.2020,AAA_111111,1
02.05.2020,AAA_111112,1
02.05.2020,AAA_111113,2
02.05.2020,AAA_111114,1
02.05.2020,AAA_111115,1
If df approximate size 10000 lines, each file contains data about few date. Like this:
02.05.2020,AAA_111111,1
02.05.2020,AAA_111112,1
.......................
03.05.2020,AAB_111113,2
03.05.2020,AAB_111114,1
.......................
04.05.2020,AAC_111115,1
Can use partitionBy("Date")
, but this will create separate folders for each day, and remove the "Date" data from the csv
Is it possible to write data about only one "Date" to one file, without using partitionBy()
?
I would like to get data about only one date in one file for any df size.
You can try repartitioning on the date column with k partitions, where k is the distinct number of dates. Something like
val numPartitions = df.agg(countDistinct("Date")).first.getAs[Long](0).toInt
(df.groupBy("Date", "Ident")
.agg(functions.count("*")))
.orderBy(functions.to_date(functions.column("Date"), "dd.MM.yyyy").cast(DateType).asc)
.repartition(numPartitions, col("Date"))
.write.format("csv").save(pathResult)
This way, you'll end up with k partitions written to k files. However, there isn't a guarantee that each file will have only one date, as repartition
uses HashPartitioner . All rows with the same date will be in the same file, but in the event of a hash collision, you'll get a file with more than one date and an empty file. You can decide whether or not this is acceptable for your use-case.
If that isn't acceptable, you'll probably have to do some post-processing on the files. Unfortunately, Spark doesn't give you a lot of options for how the file output is structured.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.