[英]Scala: How can I replace value in Dataframes using scala
For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala?例如,我想将列中所有等于 0.2 的数字替换为 0。我该如何在 Scala 中做到这一点? Thanks谢谢
Edit :编辑:
|year| make|model| comment |blank|
|2012|Tesla| S | No comment | |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null | null|
This is my Dataframe I'm trying to change Tesla in make column to S这是我的数据框,我正在尝试将 make 列中的 Tesla 更改为 S
Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD: Spark 1.6.2,Java代码(抱歉),这会将整个数据帧的每个Tesla实例更改为S,而无需通过RDD:
dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")
.otherwise(col("make")
);
Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.编辑添加@marshall245“否则”以确保非特斯拉列不会转换为NULL。
Building off of the solution from @Azeroth2b.基于@Azeroth2b 的解决方案构建。 If you want to replace only a couple of items and leave the rest unchanged.如果您只想更换几个项目而其余项目保持不变。 Do the following.请执行下列操作。 Without using the otherwise(...) method, the remainder of the column becomes null.如果不使用 else(...) 方法,列的其余部分将变为空。
import org.apache.spark.sql.functions._
val newsdf = sdf.withColumn("make", when(col("make") === "Tesla", "S")
.otherwise(col("make"))
);
Old DataFrame旧数据帧
+-----+-----+
| make|model|
+-----+-----+
|Tesla| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
New Datarame新数据帧
+-----+-----+
| make|model|
+-----+-----+
| S| S|
| Ford| E350|
|Chevy| Volt|
+-----+-----+
This can be achieved in dataframes with user defined functions (udf).这可以在具有用户定义函数 (udf) 的数据帧中实现。
import org.apache.spark.sql.functions._
val sqlcont = new org.apache.spark.sql.SQLContext(sc)
val df1 = sqlcont.jsonRDD(sc.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": ""}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": ""}""",
"""{"year":2015, "make": "Chevy", "model": "Volt", "comment": "", "blank": ""}"""
)))
val makeSIfTesla = udf {(make: String) =>
if(make == "Tesla") "S" else make
}
df1.withColumn("make", makeSIfTesla(df1("make"))).show
Note: As mentionned by Olivier Girardot, this answer is not optimized and the withColumn
solution is the one to use (Azeroth2b answer)注意:正如 Olivier Girardot 所提到的,这个答案没有优化,而withColumn
解决方案是可以使用的(Azeroth2b 答案)
Can not delete this answer as it has been accepted无法删除此答案,因为它已被接受
Here is my take on this one:这是我对这个的看法:
val rdd = sc.parallelize(
List( (2012,"Tesla","S"), (1997,"Ford","E350"), (2015,"Chevy","Volt"))
)
val sqlContext = new SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val dataframe = rdd.toDF()
dataframe.foreach(println)
dataframe.map(row => {
val row1 = row.getAs[String](1)
val make = if (row1.toLowerCase == "tesla") "S" else row1
Row(row(0),make,row(2))
}).collect().foreach(println)
//[2012,S,S]
//[1997,Ford,E350]
//[2015,Chevy,Volt]
You can actually use directly map
on the DataFrame
.您实际上可以直接在DataFrame
上使用map
。
So you basically check the column 1 for the String tesla
.因此,您基本上检查了字符串tesla
第 1 列。 If it's tesla
, use the value S
for make
else you the current value of column 1如果是tesla
,则使用值S
make
您成为第 1 列的当前值
Then build a tuple with all data from the row using the indexes (zero based) ( Row(row(0),make,row(2))
) in my example)然后使用索引(从零开始)(在我的示例中为Row(row(0),make,row(2))
)构建一个包含行中所有数据的元组
There is probably a better way to do it.可能有更好的方法来做到这一点。 I am not that familiar yet with the Spark umbrella我还不太熟悉 Spark 雨伞
df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show() df2.na.replace("Name",Map("John" -> "Akshay","Cindy" -> "Jayita")).show()
replace in class DataFrameNaFunctions of type [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame在类 DataFrameNaFunctions 中替换 [T](col: String, replacement: Map[T,T])org.apache.spark.sql.DataFrame
For running this function you must have active spark object and dataframe with headers ON.要运行此功能,您必须具有活动的火花对象和带有标题的数据帧。
import org.apache.spark.sql.functions._
val base_optin_email = spark.read.option("header","true").option("delimiter",",").schema(schema_base_optin).csv(file_optin_email).where("CPF IS NOT NULL").
withColumn("CARD_KEY", lit(translate( translate(col("cpf"), ".", ""),"-","")))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.