I have a Scala Spark dataframe with four columns (all string type) - P, Q, R, S - and a primary key (called PK) (integer type).
Each of these 4 columns may have null values. The left to right ordering of the columns is the importance/relevance of the column and needs to be preserved. The structure of the base dataframe stays the same as shown.
I want the final output to be as follows:
root
|-- PK: integer (nullable = true)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: string (nullable = true)
|-- categoryList: array (nullable = true)
| |-- myStruct: struct (nullable = true)
| | |-- category: boolean (nullable = true)
| | |-- relevance: boolean (nullable = true)
I need to create a new column derived from the 4 columns P, Q, R, S based on the following algorithm:
I'm new to Scala and here is what I have until now:
case class relevanceCaseClass(category: String, relevance: Integer)
def myUdf = udf((code: String, relevance: Integer) => relevanceCaseClass(mapM.value.getOrElse(code, null), relevance))
df.withColumn("newColumn", myUdf(col("P/Q/R/S"), 1))
The problem with this is that I cannot pass the value of the ordering inside the withColumn function. I need to let the myUdf function know the value of the relevance. Am I doing something fundamentally wrong?
Thus I should get the output:
PK P Q R S newCol
1 a b c null array(struct("a", 1), struct(null, 2), struct("c", 3), struct(null, 4))
Here, the value "b" was not found in the map and hence the value (for category) is null. Since the value for column S was already null, it stayed null. The relevance is according to the left-right column ordering.
Given a input dataframe (testing as given in OP) as
+---+---+---+---+----+
|PK |P |Q |R |S |
+---+---+---+---+----+
|1 |a |b |c |null|
+---+---+---+---+----+
root
|-- PK: integer (nullable = false)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: null (nullable = true)
and a broadcasted Map as
val mapM = spark.sparkContext.broadcast(Map("a" -> "a", "c" -> "c"))
You can define the udf
function and call that udf
function as below
def myUdf = udf((pqrs: Seq[String]) => pqrs.zipWithIndex.map(code => relevanceCaseClass(mapM.value.getOrElse(code._1, "null"), code._2+1)))
val finaldf = df.withColumn("newColumn", myUdf(array(col("P"), col("Q"), col("R"), col("S"))))
with case class as in OP
case class relevanceCaseClass(category: String, relevance: Integer)
which should give you your desired output ie finaldf
would be
+---+---+---+---+----+--------------------------------------+
|PK |P |Q |R |S |newColumn |
+---+---+---+---+----+--------------------------------------+
|1 |a |b |c |null|[[a, 1], [null, 2], [c, 3], [null, 4]]|
+---+---+---+---+----+--------------------------------------+
root
|-- PK: integer (nullable = false)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: null (nullable = true)
|-- newColumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- category: string (nullable = true)
| | |-- relevance: integer (nullable = true)
I hope the answer is helpful
You can pass multiple columns to udf as the following example code
case class Relevance(category: String, relevance: Integer)
def myUdf = udf((p: String,q: String,s: String,r: String) => Seq(
Relevance(mapM.value.getOrElse(p, null), 1),
Relevance(mapM.value.getOrElse(q, null), 2),
Relevance(mapM.value.getOrElse(s, null), 3),
Relevance(mapM.value.getOrElse(r, null), 4)
))
df.withColumn("newColumn", myUdf(df("P"),df("Q"),df("S"),df("R")))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.