简体   繁体   English

如何遍历Spark数据帧的所有行并将函数应用于每行?

[英]How can I loop through all the rows of a Spark dataframe and apply a function to each row?

I need to loop through all the rows of a Spark dataframe and use the values in each row as inputs for a function. 我需要遍历Spark数据帧的所有行,并将每行中的值用作函数的输入。

Basically, I want this to happen: 基本上,我希望这种情况发生:

  1. Get row of database 获取数据库行
  2. Separate the values in the database's row into different variables 将数据库行中的值分成不同的变量
  3. Use those variables as inputs for a function I defined 使用这些变量作为我定义的函数的输入

The thing is, I can't use collect() because the dataframe is too big. 事实是,我不能使用collect()因为数据框太大。

I am pretty sure I have to use map() to perform what I want and I have tried doing this: 我很确定我必须使用map()来执行我想要的操作,并且我尝试这样做:

MyDF.rdd.map(MyFunction)

But how can I specify the information I want to retrieve from the DataFrame? 但是,如何指定要从DataFrame检索的信息? Something like Row(0), Row(1) and Row(2)? 像Row(0),Row(1)和Row(2)之类的东西?

And how do I "feed" those values to my function? 以及如何将这些值“提供”到函数中?

"Looping" is not what you really want, but a "projection". “循环”不是您真正想要的,而是“投影”。 If for example your dataframe has 2 fields of type int and string, your code would look like this: 例如,如果您的数据框具有2个类型为int和string的字段,则您的代码将如下所示:

val myFunction = (i:Int,s:String) =>  ??? // do something with the variables

df.rdd.map(row => myFunction(row.getAs[Int]("field1"), row.getAs[String]("field2")))

or with pattern matching : 或使用模式匹配:

df.rdd.map{case Row(field1:Int, field2:String) => myFunction(field1,field2)}

Note that in Spark 2, you can directly use map on your dataframe and get a new dataframe (in spark 1.6 map would result in a RDD instead). 请注意,在Spark 2中,您可以直接在数据框上使用map并获取一个新的数据框(在spark 1.6 map将导致RDD )。

Note that instead of using map in RDD you could also use an "User Defined Function" (UDF) in the dataframe API 请注意,除了在RDD中使用map ,您还可以在数据框API中使用“用户定义的函数”(UDF)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过遍历行来预测 dataframe 中的每一行? - how can I predict for each row in the dataframe by iterating through the rows? 如何将function应用到R中的每一行dataframe? - How to apply function to each row of dataframe in R? 如何循环遍历 Pandas DataFrame 的每一行 - How to loop through each row of Pandas DataFrame 如果没有逐行迭代数据帧,这需要很长时间,我如何检查许多行是否都满足条件? - Without iterating row by row through a dataframe, which takes ages, how can I check that a number of rows all meet a condition? 如何循环遍历表的所有行? (MySQL) - How can I loop through all rows of a table? (MySQL) 对 dataframe 的每一行迭代应用优化 function - Apply optim function on iteratively each row of dataframe 在特定行值处循环遍历 dataframe 行 - Loop through rows of dataframe at specific row values 如何创建一个函数或循环,该函数或循环可以从每一行的另一列中汇总其上方的一定数量的行 - how to create a function or loop that can sum a certain amount of rows above it from another column for each row 遍历行并为每行应用不同的条件格式 - Looping through rows and apply different conditional formatting for each row 如何为列中的每个单元格执行一个函数并遍历所有工作簿? - How to execute a function for each cell in a column and loop through all workbooks?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM