簡體   English   中英

使用現有的 Integer 列在 Spark Scala ZC699575A5E8AFD9E22A7AECC1

[英]Create New Column with range of integer by using existing Integer Column in Spark Scala Dataframe

假設我有一個 Spark Scala DataFrame object,例如:

+--------+
|col1    |
+--------+
|1       |
|3       |
+--------+

我想要一個 DataFrame 像:

+-----------------+
|col1  |col2      |
+-----------------+
|1     |[0,1]     |
|3     |[0,1,2,3] |
+-----------------+

Spark 提供了大量的 APIs/Functions 來玩,大多數時候默認函數很方便,但是對於特定的任務 UserDefinedFunctions UDFs 可以被編寫。

參考https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html

import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.col
import spark.implicits._

val df = spark.sparkContext.parallelize(Seq(1,3)).toDF("index")
val rangeDF = df.withColumn("range", indexToRange(col("index")))
rangeDF.show(10)

def indexToRange: UserDefinedFunction = udf((index: Integer) => for (i <- 0 to index) yield i)
You can achieve it with the below approach

    val input_df = spark.sparkContext.parallelize(List(1, 2, 3, 4, 5)).toDF("col1")
    input_df.show(false)
Input:
+----+
|col1|
+----+
|1   |
|2   |
|3   |
|4   |
|5   |
+----+

    val output_df = input_df.rdd.map(x => x(0).toString()).map(x => (x, Range(0, x.toInt + 1).mkString(","))).toDF("col1", "col2")
    output_df.withColumn("col2", split($"col2", ",")).show(false)

Output:
+----+------------------+
|col1|col2              |
+----+------------------+
|1   |[0, 1]            |
|2   |[0, 1, 2]         |
|3   |[0, 1, 2, 3]      |
|4   |[0, 1, 2, 3, 4]   |
|5   |[0, 1, 2, 3, 4, 5]|
+----+------------------+

希望這可以幫助!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM