简体   繁体   English

如何找到 100 万篇文章标题之间的 Levenshtein 距离,其中每个标题都与其他标题进行比较?

[英]How to find Levenshtein distance between 1 million article titles, where every title is compared to every other title?

I have a large pandas DataFrame consisting of 1 million rows, and I want to get the Levenshtein distance between every entity in one column of the DataFrame. I tried merging the column with itself to generate the Cartesian product and then apply the Levenshtein distance function to this new column, but this is too computationally expensive as it would require a df of 1 trillion rows, and I'm working from a personal computer.我有一个很大的 pandas DataFrame,由 100 万行组成,我想获取 DataFrame 的一列中每个实体之间的 Levenshtein 距离。我尝试将该列与其自身合并以生成笛卡尔积,然后将 Levenshtein 距离 function 应用于这个新列,但这在计算上太昂贵了,因为它需要 1 万亿行的 df,而且我正在使用个人计算机工作。

#dataframe with 1m rows
df = pd.read_csv('titles_dates_links.csv')


df1 = DataFrame(df['title'])
df2 = DataFrame(df['title'])




#df3 is just too big for me to work with, 1 trillion rows
df3 = df1.merge(df2, how='cross')


#something like this is the function I want to apply
df3['distance'] = df3.apply(lambda x: distance(x.title_x, x.title_y), axis=1)

I was thinking that a 1m x 1m matrix with each element as a pair of titles ('title 1", "title 2") would be cheaper, but I'm having a hard time getting that data structure correct, and furthermore I don't know if this is the right solution, since ultimately I'm just interested in calculating the distance between every possible combination of titles.我在想一个 1m x 1m 的矩阵,每个元素作为一对标题 ('title 1", "title 2") 会更便宜,但我很难让数据结构正确,而且我不我不知道这是否是正确的解决方案,因为最终我只对计算每个可能的标题组合之间的距离感兴趣。

I've been trying to use pivot functions in Pandas but these require the complete dataset to exist in the first place, and the issue is that I can't generate the table that I would pivot off of, since it's too large with the approaches I've been trying.我一直在尝试在 Pandas 中使用 pivot 函数,但这些函数首先需要完整的数据集存在,问题是我无法生成我将 pivot 关闭的表,因为它的方法太大了我一直在努力

Using product from itertools should work for your case as it generates everything lazily.使用 itertools 的产品应该适合您的情况,因为它会延迟生成所有内容。

from itertools import product
titles = df['title'].tolist()
result = product(titles, titles)

And from there you can just iterate over your lazy list and apply your levenshtein distance function:)从那里你可以迭代你的懒惰列表并应用你的 levenshtein 距离 function :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何找到与其他 2 个字符串相似的字符串(就 Levenshtein 距离而言)? - How to find string similar to 2 other strings (in terms of Levenshtein distance)? 获得每个点之间的距离,并找到曲线自身的位置 - Get the distance of each point with every other, and find where the curve approach itself 如何在 pandas 中的 dataframe 上使用 for 循环计算每个唯一值的 Levenshtein 距离 - How to calculate Levenshtein distance for every unique value using a for loop on a dataframe in pandas 避免图形标题和轴标题之间重叠 - Avoid overlapping between figure title and axis titles 如何找到位于标题标签之间的文件的标题 - How to find the title of a file that sits in between title tags 保存一个字符与字符串中所有其他字符之间的距离 - Saving the distance between a character and every other character in a string 计算位置数据中一个点到其他所有点之间的距离 - Calculate distance between of one point to every other points in a position data 对于每个可能的线,由两个点到每个其他点形成的线之间的距离 - Distance between line formed by two points to every other point, for every possible line 如何从文章中提取(识别)书名? - How to extract (recognize) book title from the article? 数字列表之间的编辑距离 - Levenshtein distance between list of number
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM