![](/img/trans.png)
[英]Append values from one column to another JSON column in the same dataframe
[英]How to replace string values in one column with actual column values from other columns in the same dataframe? Part 2
我在一列中有一些字符串值,我想將該列中的子字符串替換為其他列中的值,並用空格替換所有加號(如下所示)。
我有這些List[String]
這是在動態當通過映射mapFrom
和mapTo
應指數相關。
描述值: mapFrom: ["Child", "ChildAge", "ChildState"]
列名稱: mapTo: ["name", "age", "state"]
輸入示例:
name, age, state, description
tiffany, 10, virginia, Child + ChildAge + ChildState
andrew, 11, california, ChildState + Child + ChildAge
tyler, 12, ohio, ChildAge + ChildState + Child
預期結果:
name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler
如何使用Spark Scala做到這一點?
當我從此處嘗試解決方案時: 如何用同一數據幀中其他列的實際列值替換一列中的字符串值?
輸出變為
name, age, state, description
tiffany, 10, virginia, tiffany tiffanyAge tiffanyState
andrew, 11, california, andrewState andrew andrewAge
tyler, 12, ohio, tylerAge tylerState tyler
我將使用map
而不是內置的Spark函數。
不是最干凈的,但有效的解決方案
val data = Seq(
("tiffany", 10, "virginia", "ChildName + ChildAge + ChildState"),
("andrew", 11, "california", "ChildState + ChildName + ChildAge"),
("tyler", 12, "ohio", "ChildAge + ChildState + ChildName")
).toDF("name", "age", "state", "description")
定義編碼器轉換的架構
val schema = StructType(Seq(
StructField("name", StringType),
StructField("age", IntegerType),
StructField("state", StringType),
StructField("description", StringType)
))
val encoder = RowEncoder(schema)
邏輯本身
val res = data.map(row => {
val desc = row.getAs[String]("description").replaceAll("\\s+", "").split("\\+")
val sb = new StringBuilder()
val map = desc.zipWithIndex.toMap.map(_.swap)
map(0) match {
case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
}
map(1) match {
case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
}
map(2) match {
case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
}
Row(row.getAs[String]("name"), row.getAs[Int]("age"), row.getAs[String]("state"), sb.toString())
}) (encoder)
結果
res.show(false)
+-------+---+----------+---------------------+
|name |age|state |description |
+-------+---+----------+---------------------+
|tiffany|10 |virginia |tiffany 10 virginia |
|andrew |11 |california|california andrew 11 |
|tyler |12 |ohio |12 ohio tyler |
+-------+---+----------+---------------------+
這里的問題是由於包含Child
的描述。 這是ChildAge
和ChildState
的子序列。 由於使用了正則表達式,這意味着Child
部分將被名稱替換,從而產生奇怪的輸出,例如tiffanyAge
和tiffanyState
(請注意,此處的Child
部分將由名稱替換)。
在這種情況下,有兩種簡單的解決方案,無需更改輸入:
更改Child
的正則表達式以使用超前:
val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\\\+ "
僅當此后有空格時,才匹配Child
。
將Child
放在列表的最后。 這意味着ChildAge
和ChildState
將首先匹配:
val mapFrom = List("ChildAge", "ChildState", "Child") :+ " \\\\+ "
第一種選擇的完整解決方案:
val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)
val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) =>
df.withColumn("description", regexp_replace($"description", lit(from), to))
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.