简体   繁体   中英

Compare a list values with case class using Scala and Spark

I have a dataframe like below.

+-------+------+-------+-------+
| num1  | num2 |   x   |   y   |
+-------+------+-------+-------+
|    25 |   10 | a&c   | i&j&k |
|    35 |   15 | a&b&d | i&k   |
+-------+------+-------+-------+

I have another data frame structure with the headers like,

num1, num2, a, b, c, d, i, j, k

I want to split the column data of x and y from the symbol "&". Then check whether the split data are matching with the headers above, also considering the columns num1 and num2. If it so fill the values with 1 else with 0.

The required output is:

+-------+------+---+---+---+---+---+---+---+
| num1  | num2 | a | b | c | d | i | j | k |
+-------+------+---+---+---+---+---+---+---+
|    25 |   10 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
|    35 |   15 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
+-------+------+---+---+---+---+---+---+---+

I have achieved the above output in a method like following. I created another data frame same like the first data frame but the x and y contains with an array of split data like following.

+------+-------+---------+---------+
| num1 | num2  |    x    |    y    |
+------+-------+---------+---------+
|   25 |    10 | [a,c]   | [i,j,k] |
|   35 |    15 | [a,b,d] | [i,k]   |
+------+-------+---------+---------+

Then followed the solution in this question

Although it gives me the exact solution, it is ineffective when it comes to the case where there are lot of columns like x and y.

So now I want to create a case class and match the header values with the data in x,y columns by splitting them to a list. Is it possible or is there any other solution? Can someone help me?

After trying several methods at last I came up with the following solution. I found my solution by adding some few changes to the answer for this question: Compare rows of an array column with the headers of another data frame using Scala and Spark . It worked for multiple array columns also. This is the code for it.

 val df = Seq((25, 10, "a&c", "i&j&k"), (35, 15, "a&b&d", "i&k")
      .toDF("num1", "num2", "x", "y")
  val dfProcessed = df.withColumn("x", split($"x", "&"))
      .withColumn("y", split($"y", "&"))
      .select("num1", "num2", "x", "y")

    val headers = Seq("a", "b", "c", "d", "i", "j", "k")
    val report = dfProcessed.select(Seq("num1", "num2").map(col) ++ headers.map(line => array_contains('x, line)
      || array_contains('y, line) as line) : _*)

    report.show()

I think this may help you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM