简体   繁体   English

火花顺序通过结合几组

[英]spark orderBy combine few groups

I have data like this:我有这样的数据:

Date日期 Ident标识
01.02.2002 01.02.2002 AAA_111111 AAA_111111
01.02.2002 01.02.2002 BBB_222222 BBB_222222

I count duplicates and write result in csv files my code like this:我计算重复项并将结果写入 csv 文件,我的代码如下:

(df.groupBy("Date", "Ident")
.agg(functions.count("*")))
.orderBy(functions.to_date(functions.column("Date"), "dd.MM.yyyy").cast(DateType).asc)
.write.format("csv").save(pathResult)

If df approximate size 100 lines, each file contains data about one date.如果 df 大小约为 100 行,则每个文件都包含有关一个日期的数据。 Like this:像这样:

02.05.2020,AAA_111111,1
02.05.2020,AAA_111112,1
02.05.2020,AAA_111113,2
02.05.2020,AAA_111114,1
02.05.2020,AAA_111115,1

If df approximate size 10000 lines, each file contains data about few date.如果 df 大小约为 10000 行,则每个文件都包含有关少数日期的数据。 Like this:像这样:

02.05.2020,AAA_111111,1
02.05.2020,AAA_111112,1
.......................
03.05.2020,AAB_111113,2
03.05.2020,AAB_111114,1
.......................
04.05.2020,AAC_111115,1

Can use partitionBy("Date") , but this will create separate folders for each day, and remove the "Date" data from the csv可以使用partitionBy("Date") ,但这会为每一天创建单独的文件夹,并从 csv 中删除“Date”数据

Is it possible to write data about only one "Date" to one file, without using partitionBy() ?是否可以在不使用partitionBy()的情况下将仅关于一个“日期”的数据写入一个文件?

I would like to get data about only one date in one file for any df size.对于任何 df 大小,我想在一个文件中仅获取有关一个日期的数据。

You can try repartitioning on the date column with k partitions, where k is the distinct number of dates.您可以尝试使用 k 个分区对日期列进行重新分区,其中 k 是不同的日期数。 Something like就像是

val numPartitions = df.agg(countDistinct("Date")).first.getAs[Long](0).toInt

(df.groupBy("Date", "Ident")
.agg(functions.count("*")))
.orderBy(functions.to_date(functions.column("Date"), "dd.MM.yyyy").cast(DateType).asc)
.repartition(numPartitions, col("Date"))
.write.format("csv").save(pathResult)

This way, you'll end up with k partitions written to k files.这样,您最终将 k 个分区写入 k 个文件。 However, there isn't a guarantee that each file will have only one date, as repartition uses HashPartitioner .但是,不能保证每个文件只有一个日期,因为repartition使用HashPartitioner All rows with the same date will be in the same file, but in the event of a hash collision, you'll get a file with more than one date and an empty file.具有相同日期的所有行都将位于同一个文件中,但如果发生哈希冲突,您将获得一个包含多个日期的文件和一个空文件。 You can decide whether or not this is acceptable for your use-case.您可以决定这是否适合您的用例。

If that isn't acceptable, you'll probably have to do some post-processing on the files.如果这不可接受,您可能必须对文件进行一些后处理。 Unfortunately, Spark doesn't give you a lot of options for how the file output is structured.不幸的是,Spark 并没有为您提供很多文件输出结构的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM