繁体   English   中英

Python:如何量化两组词之间的差异?

[英]Python: How do I quantify the difference between two sets of words?

我有一个用例:

search_text = '12 Jim Smith'

我有一个list of tuples ,我需要从中搜索上面的search_text并返回匹配的元组:

In [1742]: l
Out[1742]: 
[(1234, 'Jim Beam'),
 (13, 'Mark Smith'),
 (12, 'Jim Jones'),
 (23, 'Adam Smith'),
 (15, 'Mark Taylor'),
 (123, 'Mark Adam')]

我需要从l中返回匹配的元组,并根据最相关的搜索得分,如下所示:

预期 Output:

In [1698]: ans
Out[1698]: 
{
 (12, 'Jim Jones'):  66, # matches ~66% of the search_text. `12` and `Jim` both match
 (1234, 'Jim Beam'): 55, # matches ~55% of the search_text. `12` from `1234` and `Jim` both match 
 (13, 'Mark Smith'): 33, # matches ~33% of the search_text. `Smith`matches
 (23, 'Adam Smith'): 33, # matches ~33% of the search_text. `Smith`matches
 (123, 'Mark Adam'): 20  # matches ~20 %. Just `12` matches from `123`
} 

注意:上面dict中的values是随机的。 他们只是显示匹配的百分比。

以下是我的尝试:

In [1745]: l1 = [(str(i[0]), i[1]) for i in l]
In [1746]: m = set()

In [1747]: for i in search_text.split():
      ...:     if i.isnumeric():
      ...:         for item in l1:
      ...:             #print(item[0])
      ...:             if i in item[0]:
      ...:                 print("i is = ", i)
      ...:                 print("numeric part matches", item)
      ...:                 m.add(item)
      ...:     else:
      ...:         for item in l1:
      ...:             if i in item[1]:
      ...:                 print("text matches", item)
      ...:                 m.add(item)
      ...: 

In [1748]: m
Out[1748]: 
{('12', 'Jim Jones'),
 ('123', 'Mark Adam'),
 ('1234', 'Jim Beam'),
 ('13', 'Mark Smith'),
 ('23', 'Adam Smith')}

这将返回匹配的元组,但我不确定如何获得匹配相关性的百分比。

这是我会做的一种方式:

search_text = "12 Jim smith"

results = {}

for item in l:
    percentage = 0
    percentage_increment = 100/(1 + len(item[1].split(" ")))
    for elm in set(search_text.split(" ")):
        if elm.isnumeric():
            if elm in str(item[0]):
                percentage += percentage_increment * len(elm) / len(str(item[0]))
        else:
            if elm in item[1]:
                percentage += percentage_increment
    results[item] = percentage

您需要检查百分比计算,具体取决于您的列表是否可以形成问题中指定的以外。 如果可以确保search_text不包含多次名称,则可以省略该set

我这样做是基于假设,例如131234根本不匹配。

这是另一个解决方案:

class TextMatch:
    def getAppended(self, items, delimeter=""):#Function to merge all the elements of the tuples to a string to search from
        str_res = ""
        for item in items:
            str_res = str_res + str(item) + delimeter
        return str_res

    def sortTuple(self, tup):#not very important just to sort the result so that the item with maximum match is at the top
        lst = len(tup)  
        for i in range(0, lst):  
            for j in range(0, lst-i-1):
                if (tup[j][0] < tup[j + 1][0]):  
                    temp = tup[j]  
                    tup[j]= tup[j + 1]  
                    tup[j + 1]= temp  
        return tup  

    def search(self, search_text, list_to_search):#main search alggorithm
        search_text_tokens = search_text.split(' ')#splitup search term into tokens or words
        res_list = []
        for item in l:
            match_nums = 0
            for token in search_text_tokens:#checks for the occurrence of the tokens in each element of the tuple
                if token in self.getAppended(item):
                    match_nums = match_nums + len(token)#add the length of matched characters
            res_list.append( (match_nums*100/len(self.getAppended(item)) , item) ) #find percentage of characters matching for that particular case
        return self.sortTuple(res_list)
        

search_text = '12 Jim Smith'

l = [(1234, 'Jim Beam'),  (13, 'Mark Smith'),  (12, 'Jim Jones'), (23, 'Adam Smith'), (15, 'Mark Taylor'), (123, 'Mark Adam')]

matcher = TextMatch()

print("Result is :")
for item in matcher.search(search_text, l):
    print(item)

Output:

Result is :
(45.45454545454545, (12, 'Jim Jones'))
(41.666666666666664, (1234, 'Jim Beam'))
(41.666666666666664, (13, 'Mark Smith'))
(41.666666666666664, (23, 'Adam Smith'))
(16.666666666666668, (123, 'Mark Adam'))
(0.0, (15, 'Mark Taylor'))
m = []
for item in l1:
    a, n = 0, 0
    for i in search_text.split():
        if i.isnumeric():
            if i in item[0]:
                a += 1
            n += 1
        else:
            if i in item[1]:
                a += 1
            n += 1
    m.append((item, 1.0 * a / n))

m.sort(key=lambda item: -item[1])
for item in m:
    print(item)

搜索引擎不是通常以这种方式工作吗? 匹配每个并排序。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM