Can I cluster these records without having to run these loops for every record?

Question

So I want to cluster the records in this table to find which records are 'similar' (ie have enough in common). An example of the table is as follows:

        author beginpage endpage volume publication year  id_old  id_new
0          NaN       495     497    NaN             1975       1       1
1          NaN       306     317     14             1997       2       2
2        lowry       265     275    193             1951       3       3
3    smith p k        76      85    150             1985       4       4
4          NaN       248     254    NaN             1976       5       5
5     hamill p        85     100    391             1981       6       6
6          NaN      1513    1523      7             1979       7       7
7     b oregan       737     740    353             1991       8       8
8          NaN       503     517     98             1975       9       9
9      de wijs       503     517     98             1975       10      10

In this small table, the last row should get 'new_id' equal to 9, to show that these two records are similar.

To make this happen I wrote the code below, which works fine for a small number of records. However, I want to use my code for a table with 15000 records. And of course, if you do the maths, with this code this is going to take way too long. Anyone who could help me make this code more efficient? Thanks in advance!

My code, where 'dfhead' is the table with the records:

for r in range(0,len(dfhead)):
    for o_r in range(r+1,len(dfhead)):
        if ((dfhead.loc[r,c] == dfhead.loc[o_r,c]).sum() >= 3) :
            if (dfhead.loc[o_r,['id_new']] > dfhead.loc[r,['id_new']]).sum() ==1: 
                dfhead.loc[o_r,['id_new']] = dfhead.loc[r,['id_new']]

Answer 1

If you are only trying to detect whole equalities between "beginpage", "endpage","volume", "publication", "year", you should try to work on duplicates. I'm not sure about this as your code is still a mistery for me.

Something like this might work (your column "id" needs to be named "id_old" at first in the dataframe though):

cols = ["beginpage", "endpage","volume", "publication", "year"]

#isolate duplicated rows
duplicated = df[df.duplicated(cols, keep=False)]

#find the minimum key to keep
temp = duplicated.groupby(cols, as_index=False)['index'].min()
temp.rename({'id_old':'id_new'}, inplace=True, axis=1)

#import the "minimum key" to duplicated by merging the dataframes
duplicated = duplicated.merge(temp, on=cols, how="left")

#gather the "un-duplicated" rows
unduplicated = df[~df.duplicated(cols, keep=False)]

#concatenate both datasets and reset the index
new_df = unduplicated.append(duplicated)
new_df.reset_index(drop=True, inplace=True)

#where "id_new" is empty, then the data comes from "unduplicated"
#and you could fill the datas from id_old
ix = new_df[new_df.id_new.isnull()].index
new_df.loc[ix, 'id_new'] = new_df.loc[ix, 'id_old']

Can I cluster these records without having to run these loops for every record?

Question

1 answers

solution1
0 2020-12-08 21:13:09

Can I cluster these records without having to run these loops for every record?

Question

1 answers

solution1 0 2020-12-08 21:13:09

solution1
0 2020-12-08 21:13:09