Scala：如何使用 Scala 替换 Dataframes 中的值

Question

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala?例如，我想将列中所有等于 0.2 的数字替换为 0。我该如何在 Scala 中做到这一点？ Thanks谢谢

Edit :编辑：

|year| make|model| comment            |blank|
|2012|Tesla| S   | No comment         |     | 
|1997| Ford| E350|Go get one now th...|     | 
|2015|Chevy| Volt| null               | null|

This is my Dataframe I'm trying to change Tesla in make column to S这是我的数据框，我正在尝试将 make 列中的 Tesla 更改为 S

Answer 1

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD: Spark 1.6.2，Java代码（抱歉），这会将整个数据帧的每个Tesla实例更改为S，而无需通过RDD：

dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
                             .otherwise(col("make") 
                    );

Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.编辑添加@marshall245“否则”以确保非特斯拉列不会转换为NULL。

Answer 2

Building off of the solution from @Azeroth2b.基于@Azeroth2b 的解决方案构建。 If you want to replace only a couple of items and leave the rest unchanged.如果您只想更换几个项目而其余项目保持不变。 Do the following.请执行下列操作。 Without using the otherwise(...) method, the remainder of the column becomes null.如果不使用 else(...) 方法，列的其余部分将变为空。

import org.apache.spark.sql.functions._
val newsdf = sdf.withColumn("make", when(col("make") === "Tesla", "S")
                                   .otherwise(col("make"))
                           );

Old DataFrame旧数据帧

+-----+-----+ 
| make|model| 
+-----+-----+ 
|Tesla|    S| 
| Ford| E350| 
|Chevy| Volt| 
+-----+-----+

New Datarame新数据帧

+-----+-----+
| make|model|
+-----+-----+
|    S|    S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+

Answer 3

This can be achieved in dataframes with user defined functions (udf).这可以在具有用户定义函数 (udf) 的数据帧中实现。

import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
      """{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
      """{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
      """{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
    )))

val makeSIfTesla = udf {(make: String) => 
  if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show

Answer 4

Note: As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)注意：正如 Olivier Girardot 所提到的，这个答案没有优化，而withColumn解决方案是可以使用的（Azeroth2b 答案）

Can not delete this answer as it has been accepted无法删除此答案，因为它已被接受

Here is my take on this one:这是我对这个的看法：

 val rdd = sc.parallelize(
      List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
  )
  val sqlContext = new SQLContext(sc)

  // this is used to implicitly convert an RDD to a DataFrame.
  import sqlContext.implicits._

  val dataframe = rdd.toDF()

  dataframe.foreach(println)

 dataframe.map(row => {
    val row1 = row.getAs[String](1)
    val make = if (row1.toLowerCase == "tesla") "S" else row1
    Row(row(0),make,row(2))
  }).collect().foreach(println)

//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]

You can actually use directly map on the DataFrame .您实际上可以直接在DataFrame上使用map 。

So you basically check the column 1 for the String tesla .因此，您基本上检查了字符串tesla第 1 列。 If it's tesla , use the value S for make else you the current value of column 1如果是tesla ，则使用值S make您成为第 1 列的当前值

Then build a tuple with all data from the row using the indexes (zero based) ( Row(row(0),make,row(2)) ) in my example)然后使用索引（从零开始）（在我的示例中为Row(row(0),make,row(2)) ）构建一个包含行中所有数据的元组

There is probably a better way to do it.可能有更好的方法来做到这一点。 I am not that familiar yet with the Spark umbrella我还不太熟悉 Spark 雨伞

Answer 5

df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show() df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()

replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame在类 DataFrameNaFunctions 中替换 [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame

For running this function you must have active spark object and dataframe with headers ON.要运行此功能，您必须具有活动的火花对象和带有标题的数据帧。

Answer 6

import org.apache.spark.sql.functions._

val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
        withColumn("CARD_KEY",  lit(translate( translate(col("cpf"), ".", ""),"-","")))

Scala：如何使用 Scala 替换 Dataframes 中的值

问题描述

6 个解决方案

解决方案1
47 2016-10-25 16:16:12

解决方案2
29 2017-04-07 14:56:36

解决方案3
15 2015-09-17 15:15:14

解决方案4
13 已采纳 2015-09-02 18:54:25

解决方案5
3 2019-04-29 11:37:41

解决方案6
0 2020-06-25 00:02:28

Scala：如何使用 Scala 替换 Dataframes 中的值

问题描述

6 个解决方案

解决方案1 47 2016-10-25 16:16:12

解决方案2 29 2017-04-07 14:56:36

解决方案3 15 2015-09-17 15:15:14

解决方案4 13 已采纳 2015-09-02 18:54:25

解决方案5 3 2019-04-29 11:37:41

解决方案6 0 2020-06-25 00:02:28

解决方案1
47 2016-10-25 16:16:12

解决方案2
29 2017-04-07 14:56:36

解决方案3
15 2015-09-17 15:15:14

解决方案4
13 已采纳 2015-09-02 18:54:25

解决方案5
3 2019-04-29 11:37:41

解决方案6
0 2020-06-25 00:02:28