简体   繁体   English

按Inverted Index Elasticsearch中的高频项顺序对字符串进行排序

[英]Sorting string in order of high frequency terms from Inverted Index Elasticsearch

I'm new to Elasticsearch and I wanted to know if doing this was possible: 我是Elasticsearch的新手,我想知道这样做是否可行:

I have a bunch of Address strings that i want to sort on the most repetitive terms in the Strings. 我有一堆地址字符串,我想对字符串中最重复的术语进行排序。

For example: 例如:

1. Shop no 1 ABC Lane City1 - Zipcode1
2. Shop no 2 EFG Lane City1 - Zipcode2
3. Shop no 1 XYZ Lane City2 - Zipcode3
4. Shop no 3 ABC Lane City1 - Zipcode1

What i really need is to bunch them together on the most common terms in the strings. 我真正需要的是将它们放在字符串中最常见的术语上。

So what the sorted output should be for the earlier example is: 那么前面例子的排序输出应该是:

    1. Shop no 1 ABC Lane City1 - Zipcode1 
    4. Shop no 3 ABC Lane City1 - Zipcode1 # Because 1 and 2 have the most common words in them.
    2. Shop no 2 EFG Lane City1 - Zipcode2 # Second most common words with 1 and 4.
    3. Shop no 1 XYZ Lane City2 - Zipcode3 # Not all that many common terms amongst them.

I have no idea about how to go about it. 我不知道如何去做。 I know i could fire each string as a query to get the results most close to the query being fired. 我知道我可以触发每个字符串作为查询,以获得最接近被触发的查询的结果。 But i have a hundred thousand rows as such and it doesn't seem to be an efficient option at all. 但我有十万行本身,它似乎根本不是一个有效的选择。

If i could just matchall() and sort with a term filter with the most amount of recurring terms in every string, that would be really helpful. 如果我可以使用matchall()并使用每个字符串中包含大量重复术语的term过滤器进行sort ,那将非常有用。

Can there be a sort on the documents that contain most of the similar words in the inverted index? 可以对包含倒排索引中大多数相似单词的文档进行排序吗?

Here's a sample pastebin of how my data looks: Sample Addresses 这是我的数据外观的示例pastebin: 示例地址

Solution

I have used https://stackoverflow.com/a/15174569/61903 to calculate the cosine similarity of two strings (credits to @vpekar) as a base algorithm for similarity. 我使用https://stackoverflow.com/a/15174569/61903来计算两个字符串的余弦相似度 (@vpekar的信用)作为相似性的基本算法。 Generally I put all the strings into a list. 通常我将所有字符串放入列表中。 Then I set a index parameter i to 0 and loop over i as long as it is in the range of the list length. 然后我将索引参数i设置为0并循环i,只要它在列表长度的范围内。 Within that loop I iterate a position p from i+1 to length(list). 在该循环中,我将位置p从i + 1迭代到长度(列表)。 Then I find the maximum cosine value between list[i] and list[p]. 然后我找到list [i]和list [p]之间的最大余弦值。 Both textstrings will be put into a out list so they won't be taken into account in later similarity calculations. 两个文本字符串都将被列入一个列表中,因此在以后的相似度计算中不会将它们考虑在内。 Both textstrings will be put into the result list along with the cosine value, the datastructure is VectorResult. 两个文本字符串将与余弦值一起放入结果列表中,数据结构为VectorResult。

Afterwards the list is sorted by the cosine value. 之后,列表按余弦值排序。 We now have unique string pairs with descending cosine, aka similarity value. 我们现在有唯一的字符串对,下降余弦,即相似值。 HTH. HTH。

import re
import math
import timeit

from collections import Counter

WORD = re.compile(r'\w+')


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in vec1.keys()])
    sum2 = sum([vec2[x] ** 2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)


class VectorResult(object):
    def __init__(self, cosine, text_1, text_2):
        self.cosine = cosine
        self.text_1 = text_1
        self.text_2 = text_2

    def __eq__(self, other):
        if self.cosine == other.cosine:
            return True
        return False

    def __le__(self, other):
        if self.cosine <= other.cosine:
            return True
        return False

    def __ge__(self, other):
        if self.cosine >= other.cosine:
            return True
        return False

    def __lt__(self, other):
        if self.cosine < other.cosine:
            return True
        return False

    def __gt__(self, other):
        if self.cosine > other.cosine:
            return True
        return False

def main():
    start = timeit.default_timer()
    texts = []
    with open('data.txt', 'r') as f:
        texts = f.readlines()

    cosmap = []
    i = 0
    out = []
    while i < len(texts):
        max_cosine = 0.0
        current = None
        for p in range(i + 1, len(texts)):
            if texts[i] in out or texts[p] in out:
                continue
            vector1 = text_to_vector(texts[i])
            vector2 = text_to_vector(texts[p])
            cosine = get_cosine(vector1, vector2)
            if cosine > max_cosine:
                current = VectorResult(cosine, texts[i], texts[p])
                max_cosine = cosine
        if current:
            out.extend([current.text_1, current.text_2])
            cosmap.append(current)
        i += 1

    cosmap = sorted(cosmap)

    for item in reversed(cosmap):
        print(item.cosine, item.text_1, item.text_2)

    end = timeit.default_timer()

    print("Similarity Sorting of {} strings lasted {} s.".format(len(texts), end - start))

if __name__ == '__main__':
    main()

Results 结果

I used your sampple adresses at http://pastebin.com/hySkZ4Pn as test data: 我在http://pastebin.com/hySkZ4Pn上使用了你的光盘地址作为测试数据:

1.0000000000000002 NO 15& 16 1ST FLOOR,2ND MAIN ROAD,KHB COLONY,GANDINAGAR YELAHANKA
 NO 15& 16 1ST FLOOR,2ND MAIN ROAD,KHB COLONY,GANDINAGAR YELAHANKA

1.0 # 51/3 AGRAHARA YELAHANKA
 #51/3 AGRAHARA YELAHANKA

0.9999999999999999 # C M C ROAD,YALAHANKA
 # C M C ROAD,YALAHANKA

0.8728715609439696 # 1002/B B B ROAD,YELAHANKA
 0,B B ROAD,YELAHANKA

0.8432740427115678 # LAKSHMI COMPLEX C M C ROAD,YALAHANKA
 # SRI LAKSHMAN COMPLEX C M C ROAD,YALAHANKA

0.8333333333333335 # 85/1 B B M P OFFICE ROAD,KOGILU YELAHANKA
 #85/1 B B M P OFFICE NEAR KOGILU YALAHANKA

0.8249579113843053 # 689 3RD A CROSS SHESHADRIPURAM CALLEGE OPP YELAHANKA
 # 715 3RD CROSS A SECTUR SHESHADRIPURAM CALLEGE OPP YELAHANKA

0.8249579113843053 # 10 RAMAIAIA COMPLEX B B ROAD,YALAHANKA
 # JAMATI COMPLEX B B ROAD,YALAHANKA

[ SNIPPED ]

Similarity Sorting of 702 strings lasted 8.955146235887025 s.

Cosine similarity is definitely the way to go. 余弦相似性绝对是要走的路。

Igor Motov created an Elasticsearch native script to compute this similarity value for a field across many documents. Igor Motov创建了一个Elasticsearch本机脚本,用于计算跨多个文档的字段的相似度值。

You can take a look here. 你可以看看这里。

You would use this script inside script_score or for script-based sorting . 您可以在script_score使用此脚本或基于脚本的排序

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 numpy.fftn中哪些高频项? - which are the high frequency terms in numpy.fftn? 根据频率和字母顺序排序 - Sorting based on frequency and alphabetical order 反向索引实现中的字符串索引超出范围 - string index out of range in inverted index implementation 按python中的字母频率对列表进行排序(降序) - Sorting a list by frequency of letter in python (decreasing order) 如何使用dict理解和自动增量id实现从字符串列表到字典的反向索引? - How to implement inverted index from string list to dictionary using dict comprehension and auto-incremental id? 如何通过defaultdict(list)中的键(按字母顺序)排序以获取反向索引 - How to order by key (alphabetically) in defaultdict(list) for an inverted index 获取按时间频率聚合的 dataframe 高低的时间/索引 - Get time/index of high and low of dataframe aggegated by time frequency 从 python 字符串中删除逗号 - remove inverted commas from python string 使用index,python将反引号放在字符串周围 - Put the inverted quotes around a string using index , python 按从最大值到最小值的值顺序获取索引,而无需对输出索引列表进行排序,也可以修改另一个列表 - Get the index in order of values from max to min without sorting the output index list and amend another list as well
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM