如何使用 Spark SQL 在循环时将迭代的行记录保存到新的数据框或列表？

Question

I have one data frame.我有一个数据框。 That data frame is giving me list of records and then I am going to iterate over each row and doing some manipulation.该数据框给了我记录列表，然后我将遍历每一行并进行一些操作。

 for (row <- dataframe.rdd.collect()) {

// var anyval= row.mkString(",").split(",")(take the column);
}

Then I am Making some checks and then if current row will match the requirement then try to create new list or collection to save full row.然后我进行一些检查，然后如果当前行符合要求，然后尝试创建新列表或集合以保存整行。

Could you please help to with example how to save this row in new data frame using spark sql?您能否举例说明如何使用 spark sql 将此行保存在新数据框中？

Answer 1

There are different ways to achieve this, the main point is understand the basic behavior main components of spark .实现这一点有不同的方法，重点是了解spark的基本行为主要组件。 For all of them ( dataframe , dataset , rdd ) you can not update the actual value, they are inmutable objects, but you can iterate their items and based on your logic, create a new one, based on an existing one.对于所有这些（ dataframe ， dataset ， rdd ）您无法更新实际值，它们是不可变的对象，但您可以迭代它们的项目并根据您的逻辑创建一个新的，基于现有的。 Examples:例子：

val yourDF = Seq( // Sample
  ("A1", 12, null),       // Record 1
  ("B1", -1, "Mexico"),   // Record 2
  ("C1", 2, "Argentina")  // Record 3
).toDF("id", "some_value", "country") // Column definition

yourDF.show() // Visualize your DF

Above code will output:以上代码将 output：

+---+----------+---------+
| id|some_value|  country|
+---+----------+---------+
| A1|        12|     null|
| B1|        -1|   Mexico|
| C1|         2|Argentina|
+---+----------+---------+

Based on that is a Dataframe , this is how you can iterate all rows and access their items:基于这是Dataframe ，这是您可以迭代所有行并访问它们的项目的方式：

val newDF = yourDF
  .map(item =>{  // Iterate your DF 
    val id = item.getAs[String]("id") // Access their element (from row object - each item in your DF) - You need to specify datatype and 'column_name' on this approach
    val some_value = item.getAs[Integer]("some_value")
    val country = item.getAs[String]("country")
    val outputCountry = if(country != null) country.substring(0,3) else null
    // Output: id, first 3 chars of the country (if it is not null) and `some_value` multiplied by 10
    (id, outputCountry, some_value*10)
  })

newDF.show()

Above code will output:以上代码将 output：

+---+----+---+
| _1|  _2| _3|
+---+----+---+
| A1|null|120|
| B1| Mex|-10|
| C1| Arg| 20|
+---+----+---+

As you can see, column names are not the same as the first DF , this is because we are creating a new one and we did not specify the column names, we can use either .toDF("column_a", "column_b", "column_c") or use a case class like in the next example.如您所见，列名与第一个DF不同，这是因为我们正在创建一个新的并且我们没有指定列名，我们可以使用.toDF("column_a", "column_b", "column_c")或使用案例 class ，如下例所示。

Let's do same exercise, but using case classes (with Scala ).让我们做同样的练习，但使用case classes （使用Scala ）。

case class Country(id: String, some_value: Integer, country: String) // Case class

val newDF = yourDF
  .as[Country] // Cast your DF with a case class to have a Dataset
  .map(country=>{ // iterate dataset
    val id = country.id // Access their element (as object notation, easier!)
    val some_value = country.some_value
    val countryName = country.country
    val outputCountry = if(countryName != null) countryName.substring(0,3) else null
    // Output: id, first 3 chars of the country (if it is not null) and `some_value` multiplied by 10
    Country(id, some_value*10, outputCountry) // Output will use a case class to define the schema of the new object (DAtaset[Country])
  })

newDF.show()

Above code will output:以上代码将 output：

+---+----------+-------+
| id|some_value|country|
+---+----------+-------+
| A1|       120|   null|
| B1|       -10|    Mex|
| C1|        20|    Arg|
+---+----------+-------+

如何使用 Spark SQL 在循环时将迭代的行记录保存到新的数据框或列表？

问题描述

1 个解决方案

解决方案1
0 2019-11-01 16:05:18

如何使用 Spark SQL 在循环时将迭代的行记录保存到新的数据框或列表？

问题描述

1 个解决方案

解决方案1 0 2019-11-01 16:05:18

解决方案1
0 2019-11-01 16:05:18