Spark scala dataframe 获取每行的值并分配给变量

[英]Spark scala dataframe get value for each row and assign to variables

I have a dataframe like below:我有一个 dataframe 如下所示:

val df=spark.sql("select * from table") val df=spark.sql("从表中选择 *")

row1|row2|row3第 1 行|第 2 行|第 3 行

A1,B1,C1 A1,B1,C1

A2,B2,C2 A2,B2,C2

A3,B3,C3 A3,B3,C3

i want to iterate for loop to get values like this:我想迭代 for 循环以获得这样的值:

val value1="A1" val value1="A1"

val value2="B1" val value2="B1"

val value3="C1" val value3="C1"


Please help me.请帮我。

emphasized text强调文本

You have 2 options:您有 2 个选项:

  • Solution 1- Your data is big, then you must stick with dataframes.解决方案 1-您的数据很大,那么您必须坚持使用数据框。 So to apply a function on every row.因此,要在每一行上应用 function。 We must define a UDF.我们必须定义一个 UDF。

  • Solution 2- Your data is small, then you can collect the data to the driver machine and then iterate with a map.解决方案 2-您的数据很小,然后您可以将数据收集到驱动程序机器,然后使用 map 进行迭代。


val df = Seq((1,2,3), (4,5,6)).toDF("a", "b", "c")
def sum(a: Int, b: Int, c: Int) = a+b+c

// Solution 1
import org.apache.spark.sql.Row
val myUDF = udf((r: Row) => sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))

df.select(myUDF(struct($"a", $"b", $"c")).as("sum")).show

//Solution 2
df.collect.map(r=> sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2))) 

Output for both cases: Output两种情况:

|  6|
| 15|


val myUDF = udf((r: Row) => {
  val value1 = r.getAs[Int](0)
  val value2 = r.getAs[Int](1)
  val value3 = r.getAs[Int](2)

  myFunction(value1, value2, value3)


