[英]Update Column in Spark Scala
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| G K| 0 |
| 2| L_L| 1 |
| 3|null| 1 |
+---+----+-----+
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| GK| 0 |
| 2| LL| 1 |
| 3|null| 1 |
+---+----+-----+
我只想用刪除了下划線和空格的新值更新部門。 可以嗎?
scala> val inputDf = Seq((1,"G K","0 "), (2,"L_L","1"), (3,null," 1")).toDF("sno","dept","color")
inputDf: org.apache.spark.sql.DataFrame = [sno: int, dept: string ... 1 more field]
scala> inputDf.show
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| G K| 0 |
| 2| L_L| 1|
| 3|null| 1|
+---+----+-----+
問:我只想用刪除了下划線和空格的新值更新部門。 這可能嗎?
是的...
inputDf.withColumn("dept",regexp_replace('dept , "_" ,"")) // replace underscore with empty string
.withColumn("dept",regexp_replace('dept , " " ,"")) // replace space with empty string
.withColumn("color", trim('color)).show // if you want to trim which has extra space
.show
結果:
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| GK| 0|
| 2| LL| 1|
| 3|null| 1|
+---+----+-----+
或者
更聰明的方法
1) \s|_
僅用於空格和下划線。
2)使用下划線或刪除任何非字母數字使用正則表達式\W|_
val inputDf = Seq((1, "G K", "0 "), (2, "L_L", "1"), (3, null, "1")).toDF("sno", "dept", "color")
inputDf.show
inputDf.withColumn("dept", regexp_replace('dept, """\s|_""", "")).show
結果:
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| GK| 0 |
| 2| L_L| 1|
| 3|null| 1|
+---+----+-----+
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| GK| 0 |
| 2| LL| 1|
| 3|null| 1|
+---+----+-----+
我希望這正是您正在尋找的。
您可以為此使用regexp_replace
和trim
udf,如下所示
import org.apache.spark.sql.functions._
object SampleDF {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val inputDf = Seq((1,"G K","0 "),
(2,"L-L","1"),
(3,null," 1")).toDF("sno","dept","color")
inputDf
.withColumn("dept",regexp_replace($"dept"," |-",""))
.withColumn("color",trim($"color"))
.show()
}
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.