简体   繁体   中英

pivot a table with col value and use col name as a filter to return new value in spark

say I have two tables. In below example only two cols changing but not sure if pivot would work well for 10 cols. Table 1:

--------------------------
|id  |filtercol| inputid1|
--------------------------
|100| 10       | 4       |
|108| 10       | 5       | 
|200| 9        | 4       |
|106| 9        | 6       |
|110| 11       | 7       |
|130| 9        | 7       |    
--------------------------

Table 2:

   ---------------------------------
    |a      |  b       | c       | d |
    ---------------------------------
    |"hello"| 1        | 4       | 6 |
    |"world"| 2        | 5       | 6 |
    |"test" | 3        | 4       | 7 |
    ---------------------------------

I want the final table to be

    ----------------------------------
    |a      |  b       | 10      | 11|
    ----------------------------------
    |"hello"| 1        | 100     |   |
    |"world"| 2        | 108     |   |
    |"test" | 3        | 100     |110|
    ---------------------------------

So c col will be changed to 10 and col d will be renamed to 11.

Then use 10 as the filter for table 1 in the filtercol column name and use the value in column c and d as lookup value for column inputid1. Whatever value is found we change table 2 values to the value of id in table 1.

Example for the first row the new table has 100 in col 10 because we used original value in this col which was 4 for this row as the lookup for column inputid1 and then used the new c column name 10 as the filter on column filtercol and got id 100 so now replace 4 with 100 in this column.

Reason why null is returned col 11 is when used 6 as filtercol no values returned in lookup after using 6 as the filtercol.

I was thinking of possible joining and filtering but does not seem to be good solution as lets say I have col e,f,g,hi,j to check too.

   df2 = df.withColumnRenamed("c","10")
   df2 = df.withColumnRenamed("d","11")




table3df = (
    df1.join(df2,
                        df1.inputid1 == df2.10, how='left')
)


table3df = table3df.filter(col("filtercol") ==int(col("10"))

I was playing with your example a bit, and did not fully implement it yet. You did not mention what to do when multiple values for a match in column c . I resolved it with max which gave me a different answer then what you were expecting.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.getOrCreate()

table1_df = spark.sql("""
SELECT 100 as id, 10 as filtercol, 4 as inputid1 
UNION ALL
SELECT 108, 10, 5 
UNION ALL
SELECT 200, 9, 4 
UNION ALL
SELECT 106, 9, 6 
UNION ALL
SELECT 110, 11, 7 
UNION ALL
SELECT 130, 9, 7
""").alias("table1")

table2_df = spark.sql("""
SELECT 'hello' as a, 1 as b, 4 as c, 6 as d 
UNION ALL
SELECT 'world', 2, 5, 6 
UNION ALL
SELECT 'test', 3, 4, 7
""").alias("table2")

j = table2_df.join(table1_df.alias("join_c"), col("table2.c") == col("join_c.inputid1")).join(table1_df.alias("join_d"), col("table2.d") == col("join_d.inputid1"))

j.show()

j.select(
    "table2.a",
    "table2.b",
    when(col("join_c.filtercol") == "10", col("join_c.id")).alias("10"),
    when(col("join_d.filtercol") == "11", col("join_c.id")).alias("11")
).groupby("a", "b").max().show()


+-----+---+------+-------+-------+
|    a|  b|max(b)|max(10)|max(11)|
+-----+---+------+-------+-------+
|hello|  1|     1|    100|   null|
|world|  2|     2|    108|   null|
| test|  3|     3|    100|    200|
+-----+---+------+-------+-------+


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM