I have a dataframe like below.
+-------+------+-------+-------+
| num1 | num2 | x | y |
+-------+------+-------+-------+
| 25 | 10 | a&c | i&j&k |
| 35 | 15 | a&b&d | i&k |
+-------+------+-------+-------+
I have another data frame structure with the headers like,
num1, num2, a, b, c, d, i, j, k
I want to split the column data of x and y from the symbol "&". Then check whether the split data are matching with the headers above, also considering the columns num1 and num2. If it so fill the values with 1 else with 0.
The required output is:
+-------+------+---+---+---+---+---+---+---+
| num1 | num2 | a | b | c | d | i | j | k |
+-------+------+---+---+---+---+---+---+---+
| 25 | 10 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
| 35 | 15 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
+-------+------+---+---+---+---+---+---+---+
I have achieved the above output in a method like following. I created another data frame same like the first data frame but the x and y contains with an array of split data like following.
+------+-------+---------+---------+
| num1 | num2 | x | y |
+------+-------+---------+---------+
| 25 | 10 | [a,c] | [i,j,k] |
| 35 | 15 | [a,b,d] | [i,k] |
+------+-------+---------+---------+
Then followed the solution in this question
Although it gives me the exact solution, it is ineffective when it comes to the case where there are lot of columns like x and y.
So now I want to create a case class and match the header values with the data in x,y columns by splitting them to a list. Is it possible or is there any other solution? Can someone help me?
After trying several methods at last I came up with the following solution. I found my solution by adding some few changes to the answer for this question: Compare rows of an array column with the headers of another data frame using Scala and Spark . It worked for multiple array columns also. This is the code for it.
val df = Seq((25, 10, "a&c", "i&j&k"), (35, 15, "a&b&d", "i&k")
.toDF("num1", "num2", "x", "y")
val dfProcessed = df.withColumn("x", split($"x", "&"))
.withColumn("y", split($"y", "&"))
.select("num1", "num2", "x", "y")
val headers = Seq("a", "b", "c", "d", "i", "j", "k")
val report = dfProcessed.select(Seq("num1", "num2").map(col) ++ headers.map(line => array_contains('x, line)
|| array_contains('y, line) as line) : _*)
report.show()
I think this may help you.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.