按Inverted Index Elasticsearch中的高频项顺序对字符串进行排序

Question

I'm new to Elasticsearch and I wanted to know if doing this was possible: 我是Elasticsearch的新手，我想知道这样做是否可行：

I have a bunch of Address strings that i want to sort on the most repetitive terms in the Strings. 我有一堆地址字符串，我想对字符串中最重复的术语进行排序。

For example: 例如：

1. Shop no 1 ABC Lane City1 - Zipcode1
2. Shop no 2 EFG Lane City1 - Zipcode2
3. Shop no 1 XYZ Lane City2 - Zipcode3
4. Shop no 3 ABC Lane City1 - Zipcode1

What i really need is to bunch them together on the most common terms in the strings. 我真正需要的是将它们放在字符串中最常见的术语上。

So what the sorted output should be for the earlier example is: 那么前面例子的排序输出应该是：

    1. Shop no 1 ABC Lane City1 - Zipcode1 
    4. Shop no 3 ABC Lane City1 - Zipcode1 # Because 1 and 2 have the most common words in them.
    2. Shop no 2 EFG Lane City1 - Zipcode2 # Second most common words with 1 and 4.
    3. Shop no 1 XYZ Lane City2 - Zipcode3 # Not all that many common terms amongst them.

I have no idea about how to go about it. 我不知道如何去做。 I know i could fire each string as a query to get the results most close to the query being fired. 我知道我可以触发每个字符串作为查询，以获得最接近被触发的查询的结果。 But i have a hundred thousand rows as such and it doesn't seem to be an efficient option at all. 但我有十万行本身，它似乎根本不是一个有效的选择。

If i could just matchall() and sort with a term filter with the most amount of recurring terms in every string, that would be really helpful. 如果我可以使用matchall()并使用每个字符串中包含大量重复术语的term过滤器进行sort ，那将非常有用。

Can there be a sort on the documents that contain most of the similar words in the inverted index? 可以对包含倒排索引中大多数相似单词的文档进行排序吗？

Here's a sample pastebin of how my data looks: Sample Addresses 这是我的数据外观的示例pastebin：示例地址

Answer 1

Solution 解

I have used https://stackoverflow.com/a/15174569/61903 to calculate the cosine similarity of two strings (credits to @vpekar) as a base algorithm for similarity. 我使用https://stackoverflow.com/a/15174569/61903来计算两个字符串的余弦相似度（@vpekar的信用）作为相似性的基本算法。 Generally I put all the strings into a list. 通常我将所有字符串放入列表中。 Then I set a index parameter i to 0 and loop over i as long as it is in the range of the list length. 然后我将索引参数i设置为0并循环i，只要它在列表长度的范围内。 Within that loop I iterate a position p from i+1 to length(list). 在该循环中，我将位置p从i + 1迭代到长度（列表）。 Then I find the maximum cosine value between list[i] and list[p]. 然后我找到list [i]和list [p]之间的最大余弦值。 Both textstrings will be put into a out list so they won't be taken into account in later similarity calculations. 两个文本字符串都将被列入一个列表中，因此在以后的相似度计算中不会将它们考虑在内。 Both textstrings will be put into the result list along with the cosine value, the datastructure is VectorResult. 两个文本字符串将与余弦值一起放入结果列表中，数据结构为VectorResult。

Afterwards the list is sorted by the cosine value. 之后，列表按余弦值排序。 We now have unique string pairs with descending cosine, aka similarity value. 我们现在有唯一的字符串对，下降余弦，即相似值。 HTH. HTH。

import re
import math
import timeit

from collections import Counter

WORD = re.compile(r'\w+')


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in vec1.keys()])
    sum2 = sum([vec2[x] ** 2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)


class VectorResult(object):
    def __init__(self, cosine, text_1, text_2):
        self.cosine = cosine
        self.text_1 = text_1
        self.text_2 = text_2

    def __eq__(self, other):
        if self.cosine == other.cosine:
            return True
        return False

    def __le__(self, other):
        if self.cosine <= other.cosine:
            return True
        return False

    def __ge__(self, other):
        if self.cosine >= other.cosine:
            return True
        return False

    def __lt__(self, other):
        if self.cosine < other.cosine:
            return True
        return False

    def __gt__(self, other):
        if self.cosine > other.cosine:
            return True
        return False

def main():
    start = timeit.default_timer()
    texts = []
    with open('data.txt', 'r') as f:
        texts = f.readlines()

    cosmap = []
    i = 0
    out = []
    while i < len(texts):
        max_cosine = 0.0
        current = None
        for p in range(i + 1, len(texts)):
            if texts[i] in out or texts[p] in out:
                continue
            vector1 = text_to_vector(texts[i])
            vector2 = text_to_vector(texts[p])
            cosine = get_cosine(vector1, vector2)
            if cosine > max_cosine:
                current = VectorResult(cosine, texts[i], texts[p])
                max_cosine = cosine
        if current:
            out.extend([current.text_1, current.text_2])
            cosmap.append(current)
        i += 1

    cosmap = sorted(cosmap)

    for item in reversed(cosmap):
        print(item.cosine, item.text_1, item.text_2)

    end = timeit.default_timer()

    print("Similarity Sorting of {} strings lasted {} s.".format(len(texts), end - start))

if __name__ == '__main__':
    main()

Results 结果

I used your sampple adresses at http://pastebin.com/hySkZ4Pn as test data: 我在http://pastebin.com/hySkZ4Pn上使用了你的光盘地址作为测试数据：

1.0000000000000002 NO 15& 16 1ST FLOOR,2ND MAIN ROAD,KHB COLONY,GANDINAGAR YELAHANKA
 NO 15& 16 1ST FLOOR,2ND MAIN ROAD,KHB COLONY,GANDINAGAR YELAHANKA

1.0 # 51/3 AGRAHARA YELAHANKA
 #51/3 AGRAHARA YELAHANKA

0.9999999999999999 # C M C ROAD,YALAHANKA
 # C M C ROAD,YALAHANKA

0.8728715609439696 # 1002/B B B ROAD,YELAHANKA
 0,B B ROAD,YELAHANKA

0.8432740427115678 # LAKSHMI COMPLEX C M C ROAD,YALAHANKA
 # SRI LAKSHMAN COMPLEX C M C ROAD,YALAHANKA

0.8333333333333335 # 85/1 B B M P OFFICE ROAD,KOGILU YELAHANKA
 #85/1 B B M P OFFICE NEAR KOGILU YALAHANKA

0.8249579113843053 # 689 3RD A CROSS SHESHADRIPURAM CALLEGE OPP YELAHANKA
 # 715 3RD CROSS A SECTUR SHESHADRIPURAM CALLEGE OPP YELAHANKA

0.8249579113843053 # 10 RAMAIAIA COMPLEX B B ROAD,YALAHANKA
 # JAMATI COMPLEX B B ROAD,YALAHANKA

[ SNIPPED ]

Similarity Sorting of 702 strings lasted 8.955146235887025 s.

Answer 2

Cosine similarity is definitely the way to go. 余弦相似性绝对是要走的路。

Igor Motov created an Elasticsearch native script to compute this similarity value for a field across many documents. Igor Motov创建了一个Elasticsearch本机脚本，用于计算跨多个文档的字段的相似度值。

You can take a look here. 你可以看看这里。

You would use this script inside script_score or for script-based sorting . 您可以在script_score使用此脚本或基于脚本的排序。

按Inverted Index Elasticsearch中的高频项顺序对字符串进行排序

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-02-01 00:24:59

解决方案2
1 2016-02-02 05:38:53

按Inverted Index Elasticsearch中的高频项顺序对字符串进行排序

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-02-01 00:24:59

解决方案2 1 2016-02-02 05:38:53

解决方案1
3 已采纳 2016-02-01 00:24:59

解决方案2
1 2016-02-02 05:38:53