如何找到 100 万篇文章标题之间的 Levenshtein 距离，其中每个标题都与其他标题进行比较？

Question

I have a large pandas DataFrame consisting of 1 million rows, and I want to get the Levenshtein distance between every entity in one column of the DataFrame. I tried merging the column with itself to generate the Cartesian product and then apply the Levenshtein distance function to this new column, but this is too computationally expensive as it would require a df of 1 trillion rows, and I'm working from a personal computer.我有一个很大的 pandas DataFrame，由 100 万行组成，我想获取 DataFrame 的一列中每个实体之间的 Levenshtein 距离。我尝试将该列与其自身合并以生成笛卡尔积，然后将 Levenshtein 距离 function 应用于这个新列，但这在计算上太昂贵了，因为它需要 1 万亿行的 df，而且我正在使用个人计算机工作。

#dataframe with 1m rows
df = pd.read_csv('titles_dates_links.csv')


df1 = DataFrame(df['title'])
df2 = DataFrame(df['title'])




#df3 is just too big for me to work with, 1 trillion rows
df3 = df1.merge(df2, how='cross')


#something like this is the function I want to apply
df3['distance'] = df3.apply(lambda x: distance(x.title_x, x.title_y), axis=1)

I was thinking that a 1m x 1m matrix with each element as a pair of titles ('title 1", "title 2") would be cheaper, but I'm having a hard time getting that data structure correct, and furthermore I don't know if this is the right solution, since ultimately I'm just interested in calculating the distance between every possible combination of titles.我在想一个 1m x 1m 的矩阵，每个元素作为一对标题 ('title 1", "title 2") 会更便宜，但我很难让数据结构正确，而且我不我不知道这是否是正确的解决方案，因为最终我只对计算每个可能的标题组合之间的距离感兴趣。

I've been trying to use pivot functions in Pandas but these require the complete dataset to exist in the first place, and the issue is that I can't generate the table that I would pivot off of, since it's too large with the approaches I've been trying.我一直在尝试在 Pandas 中使用 pivot 函数，但这些函数首先需要完整的数据集存在，问题是我无法生成我将 pivot 关闭的表，因为它的方法太大了我一直在努力

Answer 1

Using product from itertools should work for your case as it generates everything lazily.使用 itertools 的产品应该适合您的情况，因为它会延迟生成所有内容。

from itertools import product
titles = df['title'].tolist()
result = product(titles, titles)

And from there you can just iterate over your lazy list and apply your levenshtein distance function:)从那里你可以迭代你的懒惰列表并应用你的 levenshtein 距离 function :)

如何找到 100 万篇文章标题之间的 Levenshtein 距离，其中每个标题都与其他标题进行比较？

问题描述

1 个解决方案

解决方案1
2 2023-01-23 16:42:02

如何找到 100 万篇文章标题之间的 Levenshtein 距离，其中每个标题都与其他标题进行比较？

问题描述

1 个解决方案

解决方案1 2 2023-01-23 16:42:02

解决方案1
2 2023-01-23 16:42:02