[英]Fast split Spark dataframe by keys in some column and save as different dataframes
I have Spark 2.3 very big dataframe like this:我有这样的 Spark 2.3 非常大的数据框:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like我需要按 col_key 列中的值“拆分”此数据帧,并将每个拆分的部分保存在单独的 csv 文件中,因此我必须获得更小的数据帧,例如
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and和
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.到目前为止。 Every result dataframe I need to save as different csv file.我需要将每个结果数据帧另存为不同的 csv 文件。
Count of keys is not big (20-30) but total count of data is (~200 millions records).键的数量并不大(20-30),但数据的总数是(~2 亿条记录)。
I have the solution where in the loop is selected every part of data and then saved to file:我有解决方案,在循环中选择数据的每个部分,然后保存到文件:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.它工作正常,但这个解决方案的坏问题是每个关键原因的每个数据选择都完全通过整个数据帧,并且它花费了太多时间。 I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file).我认为必须是更快的解决方案,我们只通过一次数据帧,在此期间将每条记录放入“正确”结果数据帧(或直接放入单独的文件)。 But I don't know how can to do it :) May be, someone have ideas about it?但我不知道该怎么做 :) 可能有人对此有想法?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).此外,我更喜欢使用 Spark 的 DataFrame API,因为它提供了最快的数据处理方式(因此,如果可能,不希望使用 RDD)。
You need to partition by column and save as csv file.您需要按列分区并另存为 csv 文件。 Each partition save as one file.每个分区另存为一个文件。
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?你为什么不试试这个?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.