简体   繁体   中英

Finding intersection between two dataframes iteratively

I have the following two dataframes and would like to find their intersection.

df1 = pd.DataFrame({"0": [1524, 8788, 9899, 27172],
                   "1": [1333, 4476, 78783, 90832],
                   "2": [2021, 2022, 34522, 38479]})

print(df1)

      0      1      2
0   1524   1333   2021
1   8788   4476   2022
2   9899  78783  34522
3  27172  90832  38479

df2 is a list type with one column '0' which looks like this:

          0
[1123, 2021, 1333, 6636], 
[1245, 2022, 4477, 0], 
[1524, 2023, 1, 27172], 
[2021, 2023, 90832, 38479]

Expected output should be intersection of df1 and df2, for example:

df3 = [2021, 1333],
      [2022],
      [0],
      [90832, 38479]

What I read so far relates to finding intersection for a single list, and not two dataframes with different data types. My end goal is to compute precision which is the intersection of df1 and df2 divide by the total number of my recommendations from df1 , which is 3. Additional note from comments below: The rows are aligned and would be compared pairwise. [0] in df3 does not appear anywhere but could work in case the intersection is 0.

Given

df1 :

       0      1      2
0   1524   1333   2021
1   8788   4476   2022
2   9899  78783  34522
3  27172  90832  38479

and df2 :

                            0
0    [1123, 2021, 1333, 6636]
1       [1245, 2022, 4477, 0]
2      [1524, 2023, 1, 27172]
3  [2021, 2023, 90832, 38479]

You can use set.intersection inside list comprehension:

df1_lst = df1.to_numpy().tolist()
df2_lst = df2.to_numpy().tolist()
df3 = pd.DataFrame([[list(set(i).intersection(j[0]))] for i,j in zip(df1_lst, df2_lst)], columns=['col'])

Output:

              col
0    [1333, 2021]
1          [2022]
2              []
3  [90832, 38479]
lst=[[1123, 2021, 1333, 6636], 
[1245, 2022, 4477, 0], 
[1524, 2023, 1, 27172], 
[2021, 2023, 90832, 38479]]

s=[set(x)for x in lst]#put list in set

s1=df1.agg(set,1).to_list()#make list of list of row values

[list(x.intersection(y)) for x, y in zip(s, s1)]

out

[[1333, 2021], [2022], [], [90832, 38479]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM