![](/img/trans.png)
[英]How to split a spark scala dataframe if number of rows are greater than threshold
[英]Spark Scala Split dataframe into equal number of rows
我有一个 Dataframe 并希望将它分成相等数量的行。
换句话说,我想要一个数据帧列表,其中每个数据帧都是原始 dataframe 的不连续子集。
假设输入数据帧如下:
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 100|[15.5097750043269...|
|15.509775004326936| 0| 101|[15.5097750043269...|
|15.509775004326936| 0| 102|[15.5097750043269...|
|15.509775004326936| 0| 103|[15.5097750043269...|
|15.509775004326936| 0| 104|[15.5097750043269...|
|15.509775004326936| 0| 105|[15.5097750043269...|
|15.509775004326936| 0| 106|[15.5097750043269...|
|15.509775004326936| 0| 107|[15.5097750043269...|
|15.509775004326936| 0| 108|[15.5097750043269...|
|15.509775004326936| 0| 109|[15.5097750043269...|
|15.509775004326936| 0| 110|[15.5097750043269...|
|15.509775004326936| 0| 111|[15.5097750043269...|
|15.509775004326936| 0| 112|[15.5097750043269...|
|15.509775004326936| 0| 113|[15.5097750043269...|
|15.509775004326936| 0| 114|[15.5097750043269...|
|15.509775004326936| 0| 115|[15.5097750043269...|
| 43.01955000865387| 0| 116|[43.0195500086538...|
+------------------+-----------+-----+--------------------+
我希望将它分成 K 个大小相等的数据帧。 如果 k = 4,则可能的结果是:
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 106|[15.5097750043269...|
|15.509775004326936| 0| 107|[15.5097750043269...|
|15.509775004326936| 0| 110|[15.5097750043269...|
|15.509775004326936| 0| 111|[15.5097750043269...|
+------------------+-----------+-----+--------------------+
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 104|[15.5097750043269...|
|15.509775004326936| 0| 108|[15.5097750043269...|
|15.509775004326936| 0| 112|[15.5097750043269...|
|15.509775004326936| 0| 114|[15.5097750043269...|
+------------------+-----------+-----+--------------------+
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 100|[15.5097750043269...|
|15.509775004326936| 0| 105|[15.5097750043269...|
|15.509775004326936| 0| 109|[15.5097750043269...|
|15.509775004326936| 0| 115|[15.5097750043269...|
+------------------+-----------+-----+--------------------+
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 101|[15.5097750043269...|
|15.509775004326936| 0| 102|[15.5097750043269...|
|15.509775004326936| 0| 103|[15.5097750043269...|
|15.509775004326936| 0| 113|[15.5097750043269...|
| 43.01955000865387| 0| 116|[43.0195500086538...|
+------------------+-----------+-----+--------------------+
根据我对输入和所需输出的理解,可以通过grouping
dataframe
与one groupId
grouping
来创建row numbers
。
然后,您可以filter
dataframe
比较row number
,然后根据需要storing
它们storing
在其他位置。
以下是满足您需求的临时解决方案。 您可以根据需要进行更改
val k = 4
val windowSpec = Window.partitionBy("grouped").orderBy("original_dt")
val newDF = dataFrame.withColumn("grouped", lit("grouping"))
var latestDF = newDF.withColumn("row", row_number() over windowSpec)
val totalCount = latestDF.count()
var lowLimit = 0
var highLimit = lowLimit + k
while(lowLimit < totalCount){
latestDF.where(s"row <= ${highLimit} and row > ${lowLimit}").show(false)
lowLimit = lowLimit + k
highLimit = highLimit + k
}
希望这会给您一个好的开始。
另一个解决方案是使用限制和除外。 以下程序将返回带有行数相等的数据帧的数组。 除了第一个可能包含更少的行。
var numberOfNew = 4
var input = List(1,2,3,4,5,6,7,8,9).toDF
var newFrames = 0 to numberOfNew map (_ => Seq.empty[Int].toDF) toArray
var size = input.count();
val limit = (size / numberOfNew).toInt
while (size > 0) {
newFrames(numberOfNew) = input.limit(limit)
input = input.except(newFrames(numberOfNew))
size = size - limit
numberOfNew = numberOfNew - 1
}
newFrames.foreach(_.show)
+-----+
|value|
+-----+
| 7|
+-----+
+-----+
|value|
+-----+
| 4|
| 8|
+-----+
+-----+
|value|
+-----+
| 5|
| 9|
+-----+
...
这是对斯蒂芬·施密茨(Steffen Schmitz)的改进答案,实际上这是错误的。 我已经对其进行了改进,以使其后代化。 但是,我确实对大规模的性能感到好奇。
var numberOfNew = 4
var input = Seq((1,2),(3,4),(5,6),(7,8),(9,10),(11,12)).toDF
var newFrames = 0 to numberOfNew-1 map (_ => Seq.empty[(Int, Int)].toDF) toArray
var size = input.count();
val limit = (size / numberOfNew).toInt
val limit_fract = (size / numberOfNew.toFloat)
val residual = ((limit_fract.toDouble - limit.toDouble) * size).toInt
var limit_to_use = limit
while (numberOfNew > 0) {
if (numberOfNew == 1 && residual != 0) limit_to_use = residual
newFrames(numberOfNew-1) = input.limit(limit_to_use)
input = input.except(newFrames(numberOfNew-1))
size = size - limit
numberOfNew = numberOfNew - 1
}
newFrames.foreach(_.show)
val singleDF = newFrames.reduce(_ union _)
singleDF.show(false)
返回单个数据帧:
+---+---+
| _1| _2|
+---+---+
| 7| 8|
| 3| 4|
| 11| 12|
+---+---+
+---+---+
| _1| _2|
+---+---+
| 5| 6|
+---+---+
+---+---+
| _1| _2|
+---+---+
| 9| 10|
+---+---+
+---+---+
| _1| _2|
+---+---+
| 1| 2|
+---+---+
如果要将数据集划分为n个相等的数据集
double[] arraySplit = {1,1,1...,n}; //you can also divide into ratio if you change the numbers.
List<Dataset<String>> datasetList = dataset.randomSplitAsList(arraySplit,1);
不知道这与其他选项相比是否表现出色,但我认为它至少看起来更漂亮:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(1,2,3,4,5,6,7,8,9,0).toDF
val split_count = 4
val to_be_split = df.withColumn("split", monotonically_increasing_id % split_count)
val dfs = (0 until split_count).map(n => to_be_split.where('split === n).drop('split))
dfs.foreach(_.show)
+-----+
|value|
+-----+
| 1|
| 5|
| 9|
+-----+
+-----+
|value|
+-----+
| 2|
| 6|
| 0|
+-----+
+-----+
|value|
+-----+
| 3|
| 7|
+-----+
+-----+
|value|
+-----+
| 4|
| 8|
+-----+
你可以使用
val result = df.randomSplit(Array(0.25,0.25,0.25,0.25), 1)
将 dataframe 拆分成更小的块。 该阵列可以根据所需的拆分进行扩展。 (第二个参数=1 是种子,如果需要可以更改)
阅读使用
result(0).count
or
result(1).count
based on how many splits are done.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.