简体   繁体   English

如何使用 Spark SQL 在循环时将迭代的行记录保存到新的数据框或列表?

[英]How to save the iterated row record to new data frame or list while looping, using Spark SQL?

I have one data frame.我有一个数据框。 That data frame is giving me list of records and then I am going to iterate over each row and doing some manipulation.该数据框给了我记录列表,然后我将遍历每一行并进行一些操作。

 for (row <- dataframe.rdd.collect()) {

// var anyval= row.mkString(",").split(",")(take the column);
}

Then I am Making some checks and then if current row will match the requirement then try to create new list or collection to save full row.然后我进行一些检查,然后如果当前行符合要求,然后尝试创建新列表或集合以保存整行。

Could you please help to with example how to save this row in new data frame using spark sql?您能否举例说明如何使用 spark sql 将此行保存在新数据框中?

There are different ways to achieve this, the main point is understand the basic behavior main components of spark .实现这一点有不同的方法,重点是了解spark的基本行为主要组件。 For all of them ( dataframe , dataset , rdd ) you can not update the actual value, they are inmutable objects, but you can iterate their items and based on your logic, create a new one, based on an existing one.对于所有这些( dataframedatasetrdd )您无法更新实际值,它们是不可变的对象,但您可以迭代它们的项目并根据您的逻辑创建一个新的,基于现有的。 Examples:例子:

val yourDF = Seq( // Sample
  ("A1", 12, null),       // Record 1
  ("B1", -1, "Mexico"),   // Record 2
  ("C1", 2, "Argentina")  // Record 3
).toDF("id", "some_value", "country") // Column definition

yourDF.show() // Visualize your DF

Above code will output:以上代码将 output:

+---+----------+---------+
| id|some_value|  country|
+---+----------+---------+
| A1|        12|     null|
| B1|        -1|   Mexico|
| C1|         2|Argentina|
+---+----------+---------+

Based on that is a Dataframe , this is how you can iterate all rows and access their items:基于这是Dataframe ,这是您可以迭代所有行并访问它们的项目的方式:

val newDF = yourDF
  .map(item =>{  // Iterate your DF 
    val id = item.getAs[String]("id") // Access their element (from row object - each item in your DF) - You need to specify datatype and 'column_name' on this approach
    val some_value = item.getAs[Integer]("some_value")
    val country = item.getAs[String]("country")
    val outputCountry = if(country != null) country.substring(0,3) else null
    // Output: id, first 3 chars of the country (if it is not null) and `some_value` multiplied by 10
    (id, outputCountry, some_value*10)
  })

newDF.show()

Above code will output:以上代码将 output:

+---+----+---+
| _1|  _2| _3|
+---+----+---+
| A1|null|120|
| B1| Mex|-10|
| C1| Arg| 20|
+---+----+---+

As you can see, column names are not the same as the first DF , this is because we are creating a new one and we did not specify the column names, we can use either .toDF("column_a", "column_b", "column_c") or use a case class like in the next example.如您所见,列名与第一个DF不同,这是因为我们正在创建一个新的并且我们没有指定列名,我们可以使用.toDF("column_a", "column_b", "column_c")或使用案例 class ,如下例所示。

Let's do same exercise, but using case classes (with Scala ).让我们做同样的练习,但使用case classes (使用Scala )。

case class Country(id: String, some_value: Integer, country: String) // Case class

val newDF = yourDF
  .as[Country] // Cast your DF with a case class to have a Dataset
  .map(country=>{ // iterate dataset
    val id = country.id // Access their element (as object notation, easier!)
    val some_value = country.some_value
    val countryName = country.country
    val outputCountry = if(countryName != null) countryName.substring(0,3) else null
    // Output: id, first 3 chars of the country (if it is not null) and `some_value` multiplied by 10
    Country(id, some_value*10, outputCountry) // Output will use a case class to define the schema of the new object (DAtaset[Country])
  })

newDF.show()

Above code will output:以上代码将 output:

+---+----------+-------+
| id|some_value|country|
+---+----------+-------+
| A1|       120|   null|
| B1|       -10|    Mex|
| C1|        20|    Arg|
+---+----------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Spark将新内容添加到记录 - How to add new content to a record using Spark 如何使用_id列的自定义值在spark中保存Mongodb中的数据框 - how to save data frame in Mongodb in spark using custom value for _id column 如何使用 Spark Windowing 从数据框中的当前行中查找下一个出现的项目? - How to find the next occurring item from current row in a data frame using Spark Windowing? 如何在Scala / Spark数据框中的每一行使用带有条件的withColumn - How to use withColumn with condition for the each row in Scala / Spark data frame 如何使用Spark数据帧将行数据帧转换为数组Json输出 - How to convert Row Dataframe to Array Json Output with Spark Data Frame 如何使用 SHA-2 和随机盐加密 spark sql 数据框列 - How can I encrypt spark sql data frame column using SHA-2 and a random salt 使用Spark SQL计算数据框中的列频率 - Calculate frequency of column in data frame using spark sql 在Spark中加入多个数据框时如何应用Like操作? - How to Apply Like operation while joining multiple data frame in spark? Spark SQL:是否可以为给定的数据帧分区以预定义的顺序逐行处理? - Spark SQL : Is it possible to process row after row in an pre-defined order for a given data frame partition? 如何使用Spark Scala查找和删除记录之间的新行 - How to Find and remove new line in between the record by using spark scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM