[英]How can I loop through all the rows of a Spark dataframe and apply a function to each row?
I need to loop through all the rows of a Spark dataframe and use the values in each row as inputs for a function. 我需要遍历Spark数据帧的所有行,并将每行中的值用作函数的输入。
Basically, I want this to happen: 基本上,我希望这种情况发生:
The thing is, I can't use collect()
because the dataframe is too big. 事实是,我不能使用
collect()
因为数据框太大。
I am pretty sure I have to use map()
to perform what I want and I have tried doing this: 我很确定我必须使用
map()
来执行我想要的操作,并且我尝试这样做:
MyDF.rdd.map(MyFunction)
But how can I specify the information I want to retrieve from the DataFrame? 但是,如何指定要从DataFrame检索的信息? Something like Row(0), Row(1) and Row(2)?
像Row(0),Row(1)和Row(2)之类的东西?
And how do I "feed" those values to my function? 以及如何将这些值“提供”到函数中?
"Looping" is not what you really want, but a "projection". “循环”不是您真正想要的,而是“投影”。 If for example your dataframe has 2 fields of type int and string, your code would look like this:
例如,如果您的数据框具有2个类型为int和string的字段,则您的代码将如下所示:
val myFunction = (i:Int,s:String) => ??? // do something with the variables
df.rdd.map(row => myFunction(row.getAs[Int]("field1"), row.getAs[String]("field2")))
or with pattern matching : 或使用模式匹配:
df.rdd.map{case Row(field1:Int, field2:String) => myFunction(field1,field2)}
Note that in Spark 2, you can directly use map
on your dataframe and get a new dataframe (in spark 1.6 map
would result in a RDD
instead). 请注意,在Spark 2中,您可以直接在数据框上使用
map
并获取一个新的数据框(在spark 1.6 map
将导致RDD
)。
Note that instead of using map
in RDD
you could also use an "User Defined Function" (UDF) in the dataframe API 请注意,除了在
RDD
中使用map
,您还可以在数据框API中使用“用户定义的函数”(UDF)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.