如何遍历Spark数据帧的所有行并将函数应用于每行？

Question

I need to loop through all the rows of a Spark dataframe and use the values in each row as inputs for a function. 我需要遍历Spark数据帧的所有行，并将每行中的值用作函数的输入。

Basically, I want this to happen: 基本上，我希望这种情况发生：

Get row of database 获取数据库行
Separate the values in the database's row into different variables 将数据库行中的值分成不同的变量
Use those variables as inputs for a function I defined 使用这些变量作为我定义的函数的输入

The thing is, I can't use collect() because the dataframe is too big. 事实是，我不能使用collect()因为数据框太大。

I am pretty sure I have to use map() to perform what I want and I have tried doing this: 我很确定我必须使用map()来执行我想要的操作，并且我尝试这样做：

MyDF.rdd.map(MyFunction)

But how can I specify the information I want to retrieve from the DataFrame? 但是，如何指定要从DataFrame检索的信息？ Something like Row(0), Row(1) and Row(2)? 像Row（0），Row（1）和Row（2）之类的东西？

And how do I "feed" those values to my function? 以及如何将这些值“提供”到函数中？

Answer 1

"Looping" is not what you really want, but a "projection". “循环”不是您真正想要的，而是“投影”。 If for example your dataframe has 2 fields of type int and string, your code would look like this: 例如，如果您的数据框具有2个类型为int和string的字段，则您的代码将如下所示：

val myFunction = (i:Int,s:String) =>  ??? // do something with the variables

df.rdd.map(row => myFunction(row.getAs[Int]("field1"), row.getAs[String]("field2")))

or with pattern matching : 或使用模式匹配：

df.rdd.map{case Row(field1:Int, field2:String) => myFunction(field1,field2)}

Note that in Spark 2, you can directly use map on your dataframe and get a new dataframe (in spark 1.6 map would result in a RDD instead). 请注意，在Spark 2中，您可以直接在数据框上使用map并获取一个新的数据框（在spark 1.6 map将导致RDD ）。

Note that instead of using map in RDD you could also use an "User Defined Function" (UDF) in the dataframe API 请注意，除了在RDD中使用map ，您还可以在数据框API中使用“用户定义的函数”（UDF）

如何遍历Spark数据帧的所有行并将函数应用于每行？

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-02-11 19:45:48

如何遍历Spark数据帧的所有行并将函数应用于每行？

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-02-11 19:45:48

解决方案1
2 已采纳 2018-02-11 19:45:48