简体   繁体   中英

How to find Levenshtein distance between 1 million article titles, where every title is compared to every other title?

I have a large pandas DataFrame consisting of 1 million rows, and I want to get the Levenshtein distance between every entity in one column of the DataFrame. I tried merging the column with itself to generate the Cartesian product and then apply the Levenshtein distance function to this new column, but this is too computationally expensive as it would require a df of 1 trillion rows, and I'm working from a personal computer.

#dataframe with 1m rows
df = pd.read_csv('titles_dates_links.csv')


df1 = DataFrame(df['title'])
df2 = DataFrame(df['title'])




#df3 is just too big for me to work with, 1 trillion rows
df3 = df1.merge(df2, how='cross')


#something like this is the function I want to apply
df3['distance'] = df3.apply(lambda x: distance(x.title_x, x.title_y), axis=1)

I was thinking that a 1m x 1m matrix with each element as a pair of titles ('title 1", "title 2") would be cheaper, but I'm having a hard time getting that data structure correct, and furthermore I don't know if this is the right solution, since ultimately I'm just interested in calculating the distance between every possible combination of titles.

I've been trying to use pivot functions in Pandas but these require the complete dataset to exist in the first place, and the issue is that I can't generate the table that I would pivot off of, since it's too large with the approaches I've been trying.

Using product from itertools should work for your case as it generates everything lazily.

from itertools import product
titles = df['title'].tolist()
result = product(titles, titles)

And from there you can just iterate over your lazy list and apply your levenshtein distance function:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM