简体   繁体   English

Dataframe 到 RDD 的一段代码不起作用

[英]Dataframe to RDD piece of code is not working

I am trying to read each row of dataframe and convert the row data into custom bean class.我正在尝试读取 dataframe 的每一行并将行数据转换为自定义 bean class。 But the problem here is, the code is not getting executed.但这里的问题是,代码没有被执行。 To check, I have written multiple print statement but none of the print statement present inside df.rdd.map{row=>} executed as if the complete block of code is escaped.为了检查,我编写了多个打印语句,但df.rdd.map{row=>}中没有任何打印语句被执行,就好像整个代码块被转义一样。

code snippet:代码片段:

 print("data frame:", df.show()). 

 df.rdd.map(row => {
   // Debugging
   println("Debugging")

  if(row.isNullAt(0)) {
    println("null data")
  } else {
    println(row.get(0).toString)
  }

  val employeeJobData = new EmployeeJobData

  if(row.get(0).toString == null || row.get(0).toString.isEmpty){
    employeeJobData.setEmployeeId("NULL_KEY_VALUE")
  } else {
    employeeJobData.setEmployeeId(row.get(0).toString)
  }
  employeeJobDataList.add(employeeJobData)
  } )

output of df.show() : df.show df.show()的 output :

   |employee_id|employee_name|employee_email|paygroup|level|dept_id|
   +-----------+-------------+--------------+--------+-----+-------+
   |13         |         null|          null|    null| null|   null|
   |14         |         null|          null|    null| null|   null|
   |15         |         null|          null|    null| null|   null|
   |16         |         null|          null|    null| null|   null|
   |17         |         null|          null|    null| null|   null|
   +-----------+-------------+--------------+--------+-----+-------+

You can remove unnecessary code as below and get java.util.List[EmployeeJobData] as below您可以删除不必要的代码如下并获得java.util.List[EmployeeJobData]如下

import java.util

object MapToCaseClass {

  def main(args: Array[String]): Unit = {
    val spark = Constant.getSparkSess;

    import spark.implicits._

    val df  = List((12,"name","email@email.com","paygroup","level","dept_id")).toDF()
    val employeeList : util.List[EmployeeJobData] = df
      .map(row => {
        val id = if (null == row.getString(0) || "null".equals(row.getString(0)) || row.getString(0).trim.isEmpty) {
          "NULL_KEY_VALUE"
        } else {
          row.getString(0)
        }
        EmployeeJobData(id, row.getString(1), row.getString(2),
          row.getString(3), row.getString(4), row.getString(5))
      })
      .collectAsList
  }

}

case class EmployeeJobData(employee_id: String, employee_name: String,employee_email: String,paygroup: String,
                           level: String,dept_id: String)

The above can be improved more by just setting the data type of employee_id and dept_id (ie if its numeric) to Long .只需将employee_iddept_id的数据类型(即如果它的数字)设置为Long ,就可以进一步改进上述内容。 This "null".equals and .isEmpty() can be avoided for employee_id and code can be further reduced.对于employee_id ,可以避免这个"null".equals.isEmpty() ,并且可以进一步减少代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM