I have a data set that is a supervisory hierarchy and the first two columns are id
and name
and the following columns are level 1
, level 2
, level 3
etc..
In each level xx
is a number that correlates to the id column.
id name level 1 level 2 level 3
11 sup org 1 222 333 444
222 sup org 2 11 222 333
333 sup org 3 11 222 333
456 sup org 4 222 444 333
what i'm looking for is
id name level 1 level 2 level 3
11 sup org 1 supr org 2 sup org 3 sup org 4
222 sup org 2 sup org 1 sup org 2 sup org 3
333 sup org 3 sup org 1 sup org 2 sup org 3
444 sup org 4 sup org 2 sup org 4 sup org 3
I've tried to use the rdd
function but I'm getting an error about the function not be whitelisted?
I've tried the following: where sup_lookup
is the first two columns of the table above and sup_org
is the whole table
dict1 = [row.asDict() for row in sup_lookup.collect()]
mapping_expr = create_map([x for x in chain(*dict1.items())])
df = sup_org.withColumn('Level1', mapping_expr[sup_org['Level 1']]).withColumn('Level 2', mapping_expr[sup_org['Level2']]).withColumn('Level3', mapping_expr[sup_org['Level 2']])
but I get an error about how dict1 list doesn't have attribute.items()
You can do a self join on each level column:
from pyspark.sql import functions as F
df1 = df.alias("df") \
.join(df.alias("lvl1"), F.col("lvl1.id") == F.col("df.`level 1`"), "left") \
.join(df.alias("lvl2"), F.col("lvl2.id") == F.col("df.`level 2`"), "left") \
.join(df.alias("lvl3"), F.col("lvl3.id") == F.col("df.`level 3`"), "left") \
.selectExpr("df.id", "df.name", "lvl1.name as `level 1`", "lvl2.name as `level 2`", "lvl3.name as `level 3`")
df1.show()
#+---+---------+---------+---------+---------+
#| id| name| level 1| level 2| level 3|
#+---+---------+---------+---------+---------+
#|222|sup org 2|sup org 1|sup org 2|sup org 3|
#|333|sup org 3|sup org 1|sup org 2|sup org 3|
#|444|sup org 4|sup org 2|sup org 4|sup org 3|
#| 11|sup org 1|sup org 2|sup org 3|sup org 4|
#+---+---------+---------+---------+---------+
You can use a correlated subquery to get the corresponding names from the id:
df.createOrReplaceTempView('df')
result = spark.sql("""
select
id,
name,
(select first(df2.name) from df as df2 where df1.`level 1` = df2.id) as `level 1`,
(select first(df2.name) from df as df2 where df1.`level 2` = df2.id) as `level 2`,
(select first(df2.name) from df as df2 where df1.`level 3` = df2.id) as `level 3`
from df as df1
""")
result.show()
+---+---------+---------+---------+---------+
| id| name| level 1| level 2| level 3|
+---+---------+---------+---------+---------+
| 11|sup org 1|sup org 2|sup org 3|sup org 4|
|222|sup org 2|sup org 1|sup org 2|sup org 3|
|333|sup org 3|sup org 1|sup org 2|sup org 3|
|444|sup org 4|sup org 2|sup org 4|sup org 3|
+---+---------+---------+---------+---------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.