[英]Python: How do I quantify the difference between two sets of words?
我有一个用例:
search_text = '12 Jim Smith'
我有一个list of tuples
,我需要从中搜索上面的search_text
并返回匹配的元组:
In [1742]: l
Out[1742]:
[(1234, 'Jim Beam'),
(13, 'Mark Smith'),
(12, 'Jim Jones'),
(23, 'Adam Smith'),
(15, 'Mark Taylor'),
(123, 'Mark Adam')]
我需要从l
中返回匹配的元组,并根据最相关的搜索得分,如下所示:
预期 Output:
In [1698]: ans
Out[1698]:
{
(12, 'Jim Jones'): 66, # matches ~66% of the search_text. `12` and `Jim` both match
(1234, 'Jim Beam'): 55, # matches ~55% of the search_text. `12` from `1234` and `Jim` both match
(13, 'Mark Smith'): 33, # matches ~33% of the search_text. `Smith`matches
(23, 'Adam Smith'): 33, # matches ~33% of the search_text. `Smith`matches
(123, 'Mark Adam'): 20 # matches ~20 %. Just `12` matches from `123`
}
注意:上面dict
中的values
是随机的。 他们只是显示匹配的百分比。
以下是我的尝试:
In [1745]: l1 = [(str(i[0]), i[1]) for i in l]
In [1746]: m = set()
In [1747]: for i in search_text.split():
...: if i.isnumeric():
...: for item in l1:
...: #print(item[0])
...: if i in item[0]:
...: print("i is = ", i)
...: print("numeric part matches", item)
...: m.add(item)
...: else:
...: for item in l1:
...: if i in item[1]:
...: print("text matches", item)
...: m.add(item)
...:
In [1748]: m
Out[1748]:
{('12', 'Jim Jones'),
('123', 'Mark Adam'),
('1234', 'Jim Beam'),
('13', 'Mark Smith'),
('23', 'Adam Smith')}
这将返回匹配的元组,但我不确定如何获得匹配相关性的百分比。
这是我会做的一种方式:
search_text = "12 Jim smith"
results = {}
for item in l:
percentage = 0
percentage_increment = 100/(1 + len(item[1].split(" ")))
for elm in set(search_text.split(" ")):
if elm.isnumeric():
if elm in str(item[0]):
percentage += percentage_increment * len(elm) / len(str(item[0]))
else:
if elm in item[1]:
percentage += percentage_increment
results[item] = percentage
您需要检查百分比计算,具体取决于您的列表是否可以形成问题中指定的以外。 如果可以确保search_text
不包含多次名称,则可以省略该set
。
我这样做是基于假设,例如13
和1234
根本不匹配。
这是另一个解决方案:
class TextMatch:
def getAppended(self, items, delimeter=""):#Function to merge all the elements of the tuples to a string to search from
str_res = ""
for item in items:
str_res = str_res + str(item) + delimeter
return str_res
def sortTuple(self, tup):#not very important just to sort the result so that the item with maximum match is at the top
lst = len(tup)
for i in range(0, lst):
for j in range(0, lst-i-1):
if (tup[j][0] < tup[j + 1][0]):
temp = tup[j]
tup[j]= tup[j + 1]
tup[j + 1]= temp
return tup
def search(self, search_text, list_to_search):#main search alggorithm
search_text_tokens = search_text.split(' ')#splitup search term into tokens or words
res_list = []
for item in l:
match_nums = 0
for token in search_text_tokens:#checks for the occurrence of the tokens in each element of the tuple
if token in self.getAppended(item):
match_nums = match_nums + len(token)#add the length of matched characters
res_list.append( (match_nums*100/len(self.getAppended(item)) , item) ) #find percentage of characters matching for that particular case
return self.sortTuple(res_list)
search_text = '12 Jim Smith'
l = [(1234, 'Jim Beam'), (13, 'Mark Smith'), (12, 'Jim Jones'), (23, 'Adam Smith'), (15, 'Mark Taylor'), (123, 'Mark Adam')]
matcher = TextMatch()
print("Result is :")
for item in matcher.search(search_text, l):
print(item)
Output:
Result is :
(45.45454545454545, (12, 'Jim Jones'))
(41.666666666666664, (1234, 'Jim Beam'))
(41.666666666666664, (13, 'Mark Smith'))
(41.666666666666664, (23, 'Adam Smith'))
(16.666666666666668, (123, 'Mark Adam'))
(0.0, (15, 'Mark Taylor'))
m = []
for item in l1:
a, n = 0, 0
for i in search_text.split():
if i.isnumeric():
if i in item[0]:
a += 1
n += 1
else:
if i in item[1]:
a += 1
n += 1
m.append((item, 1.0 * a / n))
m.sort(key=lambda item: -item[1])
for item in m:
print(item)
搜索引擎不是通常以这种方式工作吗? 匹配每个并排序。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.