简体   繁体   中英

Add derived column (as array of struct) based on values and ordering of other columns in Spark Scala dataframe

I have a Scala Spark dataframe with four columns (all string type) - P, Q, R, S - and a primary key (called PK) (integer type).

Each of these 4 columns may have null values. The left to right ordering of the columns is the importance/relevance of the column and needs to be preserved. The structure of the base dataframe stays the same as shown.

I want the final output to be as follows:

root
 |-- PK: integer (nullable = true)
 |-- P: string (nullable = true)
 |-- Q: string (nullable = true)
 |-- R: string (nullable = true)
 |-- S: string (nullable = true)
 |-- categoryList: array (nullable = true)
 |    |-- myStruct: struct (nullable = true)
 |    |    |-- category: boolean (nullable = true)
 |    |    |-- relevance: boolean (nullable = true)

I need to create a new column derived from the 4 columns P, Q, R, S based on the following algorithm:

  1. For every element in each of the four rows, check whether the element exists in Map "mapM"
  2. If element exists, the "category" in the struct will be the corresponding value from map M. If the element does not exist in Map M, the category shall be null.
  3. The "relevance" in the struct shall be the order of the column from left to right: P -> 1, Q -> 2, R -> 3, S -> 4.
  4. The array formed by these four structs is then added to a new column on the dataframe provided.

I'm new to Scala and here is what I have until now:

case class relevanceCaseClass(category: String, relevance: Integer)
def myUdf = udf((code: String, relevance: Integer) => relevanceCaseClass(mapM.value.getOrElse(code, null), relevance))
df.withColumn("newColumn", myUdf(col("P/Q/R/S"), 1))

The problem with this is that I cannot pass the value of the ordering inside the withColumn function. I need to let the myUdf function know the value of the relevance. Am I doing something fundamentally wrong?

Thus I should get the output:

PK   P    Q    R    S    newCol
1    a    b    c    null array(struct("a", 1), struct(null, 2), struct("c", 3), struct(null, 4))

Here, the value "b" was not found in the map and hence the value (for category) is null. Since the value for column S was already null, it stayed null. The relevance is according to the left-right column ordering.

Given a input dataframe (testing as given in OP) as

+---+---+---+---+----+
|PK |P  |Q  |R  |S   |
+---+---+---+---+----+
|1  |a  |b  |c  |null|
+---+---+---+---+----+

root
 |-- PK: integer (nullable = false)
 |-- P: string (nullable = true)
 |-- Q: string (nullable = true)
 |-- R: string (nullable = true)
 |-- S: null (nullable = true)

and a broadcasted Map as

val mapM = spark.sparkContext.broadcast(Map("a" -> "a", "c" -> "c"))

You can define the udf function and call that udf function as below

def myUdf = udf((pqrs: Seq[String]) => pqrs.zipWithIndex.map(code => relevanceCaseClass(mapM.value.getOrElse(code._1, "null"), code._2+1)))
val finaldf = df.withColumn("newColumn", myUdf(array(col("P"), col("Q"), col("R"), col("S"))))

with case class as in OP

case class relevanceCaseClass(category: String, relevance: Integer)

which should give you your desired output ie finaldf would be

+---+---+---+---+----+--------------------------------------+
|PK |P  |Q  |R  |S   |newColumn                             |
+---+---+---+---+----+--------------------------------------+
|1  |a  |b  |c  |null|[[a, 1], [null, 2], [c, 3], [null, 4]]|
+---+---+---+---+----+--------------------------------------+

root
 |-- PK: integer (nullable = false)
 |-- P: string (nullable = true)
 |-- Q: string (nullable = true)
 |-- R: string (nullable = true)
 |-- S: null (nullable = true)
 |-- newColumn: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- relevance: integer (nullable = true)

I hope the answer is helpful

You can pass multiple columns to udf as the following example code

  case class Relevance(category: String, relevance: Integer)

  def myUdf = udf((p: String,q: String,s: String,r: String) => Seq(
    Relevance(mapM.value.getOrElse(p, null), 1),
    Relevance(mapM.value.getOrElse(q, null), 2),
    Relevance(mapM.value.getOrElse(s, null), 3),
    Relevance(mapM.value.getOrElse(r, null), 4)
  ))

  df.withColumn("newColumn", myUdf(df("P"),df("Q"),df("S"),df("R")))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM