简体   繁体   English

Python:如何量化两组词之间的差异?

[英]Python: How do I quantify the difference between two sets of words?

I have a use-case where I have a:我有一个用例:

search_text = '12 Jim Smith'

I have a list of tuples from which I need to search the above search_text and return matching tuples:我有一个list of tuples ,我需要从中搜索上面的search_text并返回匹配的元组:

In [1742]: l
Out[1742]: 
[(1234, 'Jim Beam'),
 (13, 'Mark Smith'),
 (12, 'Jim Jones'),
 (23, 'Adam Smith'),
 (15, 'Mark Taylor'),
 (123, 'Mark Adam')]

I need to return the matching tuples from l with a score based on the most relevant search, something like below:我需要从l中返回匹配的元组,并根据最相关的搜索得分,如下所示:

Expected Output:预期 Output:

In [1698]: ans
Out[1698]: 
{
 (12, 'Jim Jones'):  66, # matches ~66% of the search_text. `12` and `Jim` both match
 (1234, 'Jim Beam'): 55, # matches ~55% of the search_text. `12` from `1234` and `Jim` both match 
 (13, 'Mark Smith'): 33, # matches ~33% of the search_text. `Smith`matches
 (23, 'Adam Smith'): 33, # matches ~33% of the search_text. `Smith`matches
 (123, 'Mark Adam'): 20  # matches ~20 %. Just `12` matches from `123`
} 

Note: The values in above dict are random.注意:上面dict中的values是随机的。 They are just showing the percent of match.他们只是显示匹配的百分比。

Below is my attempt:以下是我的尝试:

In [1745]: l1 = [(str(i[0]), i[1]) for i in l]
In [1746]: m = set()

In [1747]: for i in search_text.split():
      ...:     if i.isnumeric():
      ...:         for item in l1:
      ...:             #print(item[0])
      ...:             if i in item[0]:
      ...:                 print("i is = ", i)
      ...:                 print("numeric part matches", item)
      ...:                 m.add(item)
      ...:     else:
      ...:         for item in l1:
      ...:             if i in item[1]:
      ...:                 print("text matches", item)
      ...:                 m.add(item)
      ...: 

In [1748]: m
Out[1748]: 
{('12', 'Jim Jones'),
 ('123', 'Mark Adam'),
 ('1234', 'Jim Beam'),
 ('13', 'Mark Smith'),
 ('23', 'Adam Smith')}

This returns the tuples that match, but I am not sure how to get percentage of relevance of match.这将返回匹配的元组,但我不确定如何获得匹配相关性的百分比。

Heres a way I would do it:这是我会做的一种方式:

search_text = "12 Jim smith"

results = {}

for item in l:
    percentage = 0
    percentage_increment = 100/(1 + len(item[1].split(" ")))
    for elm in set(search_text.split(" ")):
        if elm.isnumeric():
            if elm in str(item[0]):
                percentage += percentage_increment * len(elm) / len(str(item[0]))
        else:
            if elm in item[1]:
                percentage += percentage_increment
    results[item] = percentage

You need to check the percentage calculation dependent on if your list can be formed other than specified in your question.您需要检查百分比计算,具体取决于您的列表是否可以形成问题中指定的以外。 You can leave out the set if you can ensure, that search_text doesn't contain names multiple times.如果可以确保search_text不包含多次名称,则可以省略该set

Im doing that based on the assumption, that eg 13 and 1234 don't match at all.我这样做是基于假设,例如131234根本不匹配。

Here is another solution:这是另一个解决方案:

class TextMatch:
    def getAppended(self, items, delimeter=""):#Function to merge all the elements of the tuples to a string to search from
        str_res = ""
        for item in items:
            str_res = str_res + str(item) + delimeter
        return str_res

    def sortTuple(self, tup):#not very important just to sort the result so that the item with maximum match is at the top
        lst = len(tup)  
        for i in range(0, lst):  
            for j in range(0, lst-i-1):
                if (tup[j][0] < tup[j + 1][0]):  
                    temp = tup[j]  
                    tup[j]= tup[j + 1]  
                    tup[j + 1]= temp  
        return tup  

    def search(self, search_text, list_to_search):#main search alggorithm
        search_text_tokens = search_text.split(' ')#splitup search term into tokens or words
        res_list = []
        for item in l:
            match_nums = 0
            for token in search_text_tokens:#checks for the occurrence of the tokens in each element of the tuple
                if token in self.getAppended(item):
                    match_nums = match_nums + len(token)#add the length of matched characters
            res_list.append( (match_nums*100/len(self.getAppended(item)) , item) ) #find percentage of characters matching for that particular case
        return self.sortTuple(res_list)
        

search_text = '12 Jim Smith'

l = [(1234, 'Jim Beam'),  (13, 'Mark Smith'),  (12, 'Jim Jones'), (23, 'Adam Smith'), (15, 'Mark Taylor'), (123, 'Mark Adam')]

matcher = TextMatch()

print("Result is :")
for item in matcher.search(search_text, l):
    print(item)

Output: Output:

Result is :
(45.45454545454545, (12, 'Jim Jones'))
(41.666666666666664, (1234, 'Jim Beam'))
(41.666666666666664, (13, 'Mark Smith'))
(41.666666666666664, (23, 'Adam Smith'))
(16.666666666666668, (123, 'Mark Adam'))
(0.0, (15, 'Mark Taylor'))
m = []
for item in l1:
    a, n = 0, 0
    for i in search_text.split():
        if i.isnumeric():
            if i in item[0]:
                a += 1
            n += 1
        else:
            if i in item[1]:
                a += 1
            n += 1
    m.append((item, 1.0 * a / n))

m.sort(key=lambda item: -item[1])
for item in m:
    print(item)

Aren't search engines usually working in this way?搜索引擎不是通常以这种方式工作吗? Matching for each and sorting.匹配每个并排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM