简体   繁体   English

Scala:如何使用 Scala 替换 Dataframes 中的值

[英]Scala: How can I replace value in Dataframes using scala

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala?例如,我想将列中所有等于 0.2 的数字替换为 0。我该如何在 Scala 中做到这一点? Thanks谢谢

Edit :编辑

|year| make|model| comment            |blank|
|2012|Tesla| S   | No comment         |     | 
|1997| Ford| E350|Go get one now th...|     | 
|2015|Chevy| Volt| null               | null| 

This is my Dataframe I'm trying to change Tesla in make column to S这是我的数据框,我正在尝试将 make 列中的 Tesla 更改为 S

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD: Spark 1.6.2,Java代码(抱歉),这会将整个数据帧的每个Tesla实例更改为S,而无需通过RDD:

dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
                             .otherwise(col("make") 
                    );

Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.编辑添加@marshall245“否则”以确保非特斯拉列不会转换为NULL。

Building off of the solution from @Azeroth2b.基于@Azeroth2b 的解决方案构建。 If you want to replace only a couple of items and leave the rest unchanged.如果您只想更换几个项目而其余项目保持不变。 Do the following.请执行下列操作。 Without using the otherwise(...) method, the remainder of the column becomes null.如果不使用 else(...) 方法,列的其余部分将变为空。

import org.apache.spark.sql.functions._
val newsdf = sdf.withColumn("make", when(col("make") === "Tesla", "S")
                                   .otherwise(col("make"))
                           );

Old DataFrame旧数据帧

+-----+-----+ 
| make|model| 
+-----+-----+ 
|Tesla|    S| 
| Ford| E350| 
|Chevy| Volt| 
+-----+-----+ 

New Datarame新数据帧

+-----+-----+
| make|model|
+-----+-----+
|    S|    S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+

This can be achieved in dataframes with user defined functions (udf).这可以在具有用户定义函数 (udf) 的数据帧中实现。

import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
      """{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
      """{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
      """{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
    )))

val makeSIfTesla = udf {(make: String) => 
  if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show

Note: As mentionned by Olivier Girardot, this answer is not optimized and the withColumn solution is the one to use (Azeroth2b answer)注意:正如 Olivier Girardot 所提到的,这个答案没有优化,而withColumn解决方案是可以使用的(Azeroth2b 答案)

Can not delete this answer as it has been accepted无法删除此答案,因为它已被接受


Here is my take on this one:这是我对这个的看法:

 val rdd = sc.parallelize(
      List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
  )
  val sqlContext = new SQLContext(sc)

  // this is used to implicitly convert an RDD to a DataFrame.
  import sqlContext.implicits._

  val dataframe = rdd.toDF()

  dataframe.foreach(println)

 dataframe.map(row => {
    val row1 = row.getAs[String](1)
    val make = if (row1.toLowerCase == "tesla") "S" else row1
    Row(row(0),make,row(2))
  }).collect().foreach(println)

//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]

You can actually use directly map on the DataFrame .您实际上可以直接在DataFrame上使用map

So you basically check the column 1 for the String tesla .因此,您基本上检查了字符串tesla第 1 列。 If it's tesla , use the value S for make else you the current value of column 1如果是tesla ,则使用值S make您成为第 1 列的当前值

Then build a tuple with all data from the row using the indexes (zero based) ( Row(row(0),make,row(2)) ) in my example)然后使用索引(从零开始)(在我的示例中为Row(row(0),make,row(2)) )构建一个包含行中所有数据的元组

There is probably a better way to do it.可能有更好的方法来做到这一点。 I am not that familiar yet with the Spark umbrella我还不太熟悉 Spark 雨伞

df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show() df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()

replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame在类 DataFrameNaFunctions 中替换 [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame

For running this function you must have active spark object and dataframe with headers ON.要运行此功能,您必须具有活动的火花对象和带有标题的数据帧。

import org.apache.spark.sql.functions._

val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
        withColumn("CARD_KEY",  lit(translate( translate(col("cpf"), ".", ""),"-","")))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM