如果行数大于阈值，如何拆分 Spark Scala 数据帧

Question

如果行数大于阈值，我正在尝试将巨大的数据框加载到部件中。 如果数据帧有 4 百万行，我们有一个 3 百万行的阈值，我们想先加载 3 百万行，然后在下一个循环中加载 1 百万行。 我正在尝试以下方法：这是我尝试过的伪代码，但这些东西在 scala 中不起作用我正在寻找 scala 中的替代品，这可能是一种更好的方法

if(deltaExtractCount > 3000000)
{
  length = len(df)
  count = 0
  while (count < length)
  { 
    new_df = df[count : count + 3000000]
    insert(new_df)
    count = count + 3M
  }
}

这是我正在尝试但没有成功的方法。 还没有找到等效的 scala 函数这个伪代码更适合 python 。 我正在使用 spark 3.1.2 和 scala 2.12 如果有其他方法，请告诉我如何实现这种拆分

Answer 1

我写了这段代码，希望你能理解。 我还写了一些评论以使其更容易。 假设csv是您的数据集：

    // Assign an increasing ID to the column, so we know which rows to get
    csv = csv.withColumn("ID", expr("row_number() over (order by name)"));

    // Total dataset count
    long count = csv.count();
    // Threshold, modify this to the number you want
    long loadThreshold = 300000;
    // This will tell you how many 'loops' you will get
    double loopingTimes = Math.ceil(count * 1.0 / loadThreshold);

    // emptyDataset is an empty dataset, I did this just to fetch the schema
    Dataset<Row> emptyDataset = csv.limit(0);
    for (int i = 0; i < loopingTimes; i++) {
        // This is the new fetched dataset
        Dataset<Row> withThreshold = csv
                .where(col("ID").gt(i * loadThreshold).and(col("ID").leq((i + 1) * loadThreshold)));

        // Once 300000 rows are fetched, union to the main file
        emptyDataset = emptyDataset.union(withThreshold);
    }

我做了一个测试用例， loadThreshold等于 25，数据集count为 60 行，我得到loadThreshold = 3。然后以这种方式获取分区：

第一个循环：25 行

第二循环：25行

第三个循环：10行

这是用 Java 编写的，但在 Scala 中也几乎相同，祝你好运！

Answer 2

以下是我的做法，它类似地从Spark Scala 将数据帧中的提示拆分为相等的行数

if (deltaExtractCount > 25) {

val k = 25
val totalCount = deltaExtractCount
var lowLimit = 0
var highLimit = lowLimit + k

while (lowLimit < totalCount) {
var split_df = masterLoadDf.where(s"row_num <= ${highLimit} and row_num > ${lowLimit}")
lowLimit = lowLimit + k
highLimit = highLimit + k

InsertIntoDB(split_df)


}

} else {
 InsertIntoDB(loadDeltaBatchDF)
}

如果行数大于阈值，如何拆分 Spark Scala 数据帧

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-07-14 23:21:15

解决方案2
0 2022-07-15 15:14:51

如果行数大于阈值，如何拆分 Spark Scala 数据帧

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-07-14 23:21:15

解决方案2 0 2022-07-15 15:14:51

解决方案1
1 已采纳 2022-07-14 23:21:15

解决方案2
0 2022-07-15 15:14:51