So I want to cluster the records in this table to find which records are 'similar' (ie have enough in common). An example of the table is as follows:
author beginpage endpage volume publication year id_old id_new
0 NaN 495 497 NaN 1975 1 1
1 NaN 306 317 14 1997 2 2
2 lowry 265 275 193 1951 3 3
3 smith p k 76 85 150 1985 4 4
4 NaN 248 254 NaN 1976 5 5
5 hamill p 85 100 391 1981 6 6
6 NaN 1513 1523 7 1979 7 7
7 b oregan 737 740 353 1991 8 8
8 NaN 503 517 98 1975 9 9
9 de wijs 503 517 98 1975 10 10
In this small table, the last row should get 'new_id' equal to 9, to show that these two records are similar.
To make this happen I wrote the code below, which works fine for a small number of records. However, I want to use my code for a table with 15000 records. And of course, if you do the maths, with this code this is going to take way too long. Anyone who could help me make this code more efficient? Thanks in advance!
My code, where 'dfhead' is the table with the records:
for r in range(0,len(dfhead)):
for o_r in range(r+1,len(dfhead)):
if ((dfhead.loc[r,c] == dfhead.loc[o_r,c]).sum() >= 3) :
if (dfhead.loc[o_r,['id_new']] > dfhead.loc[r,['id_new']]).sum() ==1:
dfhead.loc[o_r,['id_new']] = dfhead.loc[r,['id_new']]
If you are only trying to detect whole equalities between "beginpage", "endpage","volume", "publication", "year", you should try to work on duplicates. I'm not sure about this as your code is still a mistery for me.
Something like this might work (your column "id" needs to be named "id_old" at first in the dataframe though):
cols = ["beginpage", "endpage","volume", "publication", "year"]
#isolate duplicated rows
duplicated = df[df.duplicated(cols, keep=False)]
#find the minimum key to keep
temp = duplicated.groupby(cols, as_index=False)['index'].min()
temp.rename({'id_old':'id_new'}, inplace=True, axis=1)
#import the "minimum key" to duplicated by merging the dataframes
duplicated = duplicated.merge(temp, on=cols, how="left")
#gather the "un-duplicated" rows
unduplicated = df[~df.duplicated(cols, keep=False)]
#concatenate both datasets and reset the index
new_df = unduplicated.append(duplicated)
new_df.reset_index(drop=True, inplace=True)
#where "id_new" is empty, then the data comes from "unduplicated"
#and you could fill the datas from id_old
ix = new_df[new_df.id_new.isnull()].index
new_df.loc[ix, 'id_new'] = new_df.loc[ix, 'id_old']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.