簡體   English   中英

如何用同一數據幀中其他列的實際列值替換一列中的字符串值? 第2部分

[英]How to replace string values in one column with actual column values from other columns in the same dataframe? Part 2

我在一列中有一些字符串值,我想將該列中的子字符串替換為其他列中的值,並用空格替換所有加號(如下所示)。

我有這些List[String]這是在動態當通過映射mapFrommapTo應指數相關。

描述值: mapFrom: ["Child", "ChildAge", "ChildState"]

列名稱: mapTo: ["name", "age", "state"]

輸入示例:

name, age, state, description
tiffany, 10, virginia, Child + ChildAge + ChildState
andrew, 11, california, ChildState + Child + ChildAge
tyler, 12, ohio, ChildAge + ChildState + Child

預期結果:

name, age, state, description
tiffany, 10, virginia, tiffany 10 virginia
andrew, 11, california, california andrew 11
tyler, 12, ohio, 12 ohio tyler

如何使用Spark Scala做到這一點?

當我從此處嘗試解決方案時: 如何用同一數據幀中其他列的實際列值替換一列中的字符串值?

輸出變為

name, age, state, description
tiffany, 10, virginia, tiffany tiffanyAge tiffanyState
andrew, 11, california, andrewState andrew andrewAge
tyler, 12, ohio, tylerAge tylerState tyler

我將使用map而不是內置的Spark函數。
不是最干凈的,但有效的解決方案

val data = Seq(
  ("tiffany", 10, "virginia", "ChildName + ChildAge + ChildState"),
  ("andrew", 11, "california", "ChildState + ChildName + ChildAge"),
  ("tyler", 12, "ohio", "ChildAge + ChildState + ChildName")
).toDF("name", "age", "state", "description")

定義編碼器轉換的架構

val schema = StructType(Seq(
  StructField("name", StringType),
  StructField("age", IntegerType),
  StructField("state", StringType),
  StructField("description", StringType)
))
val encoder = RowEncoder(schema)

邏輯本身

val res = data.map(row => {
  val desc = row.getAs[String]("description").replaceAll("\\s+", "").split("\\+")
  val sb = new StringBuilder()
  val map = desc.zipWithIndex.toMap.map(_.swap)

  map(0) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(1) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  map(2) match {
    case "ChildState" => sb.append(row.getAs[String]("state")).append(" ")
    case "ChildAge" => sb.append(row.getAs[Int]("age")).append(" ")
    case "ChildName" => sb.append(row.getAs[String]("name")).append(" ")
  }

  Row(row.getAs[String]("name"), row.getAs[Int]("age"), row.getAs[String]("state"), sb.toString())
}) (encoder)

結果

res.show(false)
+-------+---+----------+---------------------+
|name   |age|state     |description          | 
+-------+---+----------+---------------------+
|tiffany|10 |virginia  |tiffany 10 virginia  |
|andrew |11 |california|california andrew 11 |
|tyler  |12 |ohio      |12 ohio tyler        |
+-------+---+----------+---------------------+

這里的問題是由於包含Child的描述。 這是ChildAgeChildState的子序列。 由於使用了正則表達式,這意味着Child部分將被名稱替換,從而產生奇怪的輸出,例如tiffanyAgetiffanyState (請注意,此處的Child部分將由名稱替換)。

在這種情況下,有兩種簡單的解決方案,無需更改輸入:

  1. 更改Child的正則表達式以使用超前:

     val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\\\+ " 

    僅當此后有空格時,才匹配Child

  2. Child放在列表的最后。 這意味着ChildAgeChildState將首先匹配:

     val mapFrom = List("ChildAge", "ChildState", "Child") :+ " \\\\+ " 

第一種選擇的完整解決方案:

val mapFrom = List("Child(?= )", "ChildAge", "ChildState") :+ " \\+ "
val mapTo = List("name", "age", "state").map(col) :+ lit(" ")
val mapToFrom = mapFrom.zip(mapTo)

val df2 = mapToFrom.foldLeft(df){case (df, (from, to)) => 
  df.withColumn("description", regexp_replace($"description", lit(from), to))
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM