[英]Spark scala dataframe get value for each row and assign to variables
I have a dataframe like below:我有一个 dataframe 如下所示:
val df=spark.sql("select * from table") val df=spark.sql("从表中选择 *")
row1|row2|row3第 1 行|第 2 行|第 3 行
A1,B1,C1 A1,B1,C1
A2,B2,C2 A2,B2,C2
A3,B3,C3 A3,B3,C3
i want to iterate for loop to get values like this:我想迭代 for 循环以获得这样的值:
val value1="A1" val value1="A1"
val value2="B1" val value2="B1"
val value3="C1" val value3="C1"
function(value1,value2,value3)函数(值1,值2,值3)
Please help me.请帮我。
emphasized text强调文本
You have 2 options:您有 2 个选项:
Solution 1- Your data is big, then you must stick with dataframes.解决方案 1-您的数据很大,那么您必须坚持使用数据框。 So to apply a function on every row.
因此,要在每一行上应用 function。 We must define a UDF.
我们必须定义一个 UDF。
Solution 2- Your data is small, then you can collect the data to the driver machine and then iterate with a map.解决方案 2-您的数据很小,然后您可以将数据收集到驱动程序机器,然后使用 map 进行迭代。
Example:例子:
val df = Seq((1,2,3), (4,5,6)).toDF("a", "b", "c")
def sum(a: Int, b: Int, c: Int) = a+b+c
// Solution 1
import org.apache.spark.sql.Row
val myUDF = udf((r: Row) => sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
df.select(myUDF(struct($"a", $"b", $"c")).as("sum")).show
//Solution 2
df.collect.map(r=> sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
Output for both cases: Output两种情况:
+---+
|sum|
+---+
| 6|
| 15|
+---+
EDIT:编辑:
val myUDF = udf((r: Row) => {
val value1 = r.getAs[Int](0)
val value2 = r.getAs[Int](1)
val value3 = r.getAs[Int](2)
myFunction(value1, value2, value3)
})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.