简体   繁体   中英

Spark compare two different datafames with different schema row by row

I am trying to identify duplicate rows between random dataframes

I will not know what columns are duplicated since they might have different columns names

df1 with columns a, b, c
a = [1,2,3,4,5]
b = [1,1,1,1,1]
c = [5,6,7,8,9]

df2 with columns x, y, z
x = [1,2,3,5,6]
y= [1,1,1,1,1]
z = [8,9,10,1,11,]

we can tell the match rate table as

df1 df2 match rate
a    x   80%
b    y   100 %
c    z   20%

and the codes should check a with x, y, zb with x, y, z, ...etc.

the result is expected as a new dataframe with column names from 2 joined dfs and their match rate

I did try different ways using plain joins and intersect but nothing close, any help is appreciated.

does your df1 and df2 have the same key like row_number? if yes, you could join them together use:

df_joined = df1.join(df2, "key")

after the join, you could use vectorassembler to make array of (a,b,c) called left_array, and could make array of (x,y,z) called right array

after the array, use array intersect to get the intersect of two field, then dived the length gets you the overlap percentage.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM