简体   繁体   中英

Spark Dataframe handling duplicated name while joining using generic method

There are two dataframes

Data1 = [("1",None,"a","Kelvin"), \
        ("2","1","b","Ho"), \
        ("2","2","b","Ho"), \
        ("7","1","c","Shuai"), \
    
      ]
col1= ["ID","s_name","group","name"]
tableA = spark.createDataFrame(data = Data1, schema = col1)
    
Data2 = [("1","1","bird"), \
        ("2",None,"tiger"), \
      ]
col2= ["ID","s_name","classes"]
tableB = spark.createDataFrame(data = Data2, schema = col2)

When they are joined together, tableA.join(tableB,["ID"],"left") the columns of the new dataframe are: ['ID', 's_name', 'group', 'name', 's_name', 'classes']

Since tableA contains tableB, the key column in tableB is redundant and ambiguous if we want to coalesce joined dataframe later so I can use

tableB = tableB.withColumnRenamed("s_name","s_name_2")

then after join, I apply drop

val = "s_name"
tableA.join(tableB,["ID"],"left").withColumn(val,coalesce(col(val),col(val+"_2"))).drop(val+"_2")

The problem with this approach is if I want to use generic variable to handle all ambiguous name cases: for instance the column headers with same name is stored in a list called ambiguous_name then I do

for val in ambiguous_name:
  tableB = tableB.withColumnRenamed(val,val+"_2")

then drop the ambiguous column using

joined_table = tableA.join(tableB,["ID"],"left").drop("key_2")
for val in ambiguous_name:
  joined_table = joined_table.drop(val+"_2")

But, tableB looks something like this:

Data2 = [("1","1","test","bird"), \
        ("2",None,"test2","tiger"), \
      ]
col2= ["ID","s_name","s_name_2","classes"]
tableB = spark.createDataFrame(data = Data2, schema = col2)

Of course I can call it tableB = tableB.withColumnRenamed(val,val+"_3") but what happen if col2= ["ID","s_name","s_name_3","classes"] ?

Is there a generic postfix naming to resolve this?

If I got you right, the issue you have can be resolved in away using alias.

new = tableA.alias('tableA').join(tableB.alias('tableB'),["ID"],"left")

new = new.withColumnRenamed('tableB.key','tableB.key'+"_3").show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM