简体   繁体   English

查找输入字符串到 Python 中元组列表的所有可能匹配项(以任何顺序/顺序)

[英]Find all possible matches (in any order/sequence) of an input string to a list of tuples in Python

I want to match an input string to a list of tuples and find out the top N close matches from the list of tuples.我想将输入字符串与元组列表匹配,并从元组列表中找出前 N 个接近匹配项。 The list of tuple has around 2000 items.元组列表有大约 2000 个项目。 The problem I am facing is that I have used fuzzywuzzy process.extract method but it returns a huge number of tuples with the same confidence score.我面临的问题是我使用fuzzywuzzy process.extract method ,但它返回了大量具有相同置信度分数的元组。 The quality of match is also not good.比赛质量也不好。 What I would like to do is get all the matches based on my input(order is not important)我想做的是根据我的输入获取所有匹配项(顺序不重要)

Example: 
input string: 'fruit apple'
    
List of tuples = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]

From here I want to find all strings from the list of strings which contains both the word 'fruit apple' in any order.从这里我想从字符串列表中找到所有字符串,其中包含任何顺序的单词“fruit apple”。

Expected output:
[('apple fruit', 91), ('the fruit is an apple', 34)]

I know with fuzzywuzzy, it's 1 line of code but the issue is when the size of the list of tuples to be checked against is very large, fuzzywuzzy assigns the same confidence score to unrelated items.我知道fuzzywuzzy,它是1行代码,但问题是当要检查的元组列表的大小非常大时,fuzzywuzzy会为不相关的项目分配相同的置信度分数。

Attaching code tried till now for reference:到目前为止尝试附加代码以供参考:

def preprocessing(fruit):
    stop_words = stopwords.words('english')
    fruit_string = re.sub(r'[a-z][/][a-z][/]*[a-z]{0,1}', '', fruit_string)
    fruit_string = re.sub(r'[^A-Za-z0-9\s]+', '', fruit_string)
    return ' '.join(each_word for each_word in fruit_string.split() if each_word not in stop_words and len(each_word) > 2)
    

#All possible fruit combination list
nrows=[]
with open("D:/fruits.csv", 'r') as csvfile: 
    csvreader = csv.reader(csvfile)
    fields = next(csvreader)
    for row in csvreader: 
        nrows.append(row)
        
flat_list = [item for items in nrows for item in items]        



def get_matching_fruits(input_raw_text):
    preprocessed_synonym = preprocessing(input_raw_text)
    text = nltk.word_tokenize(preprocessed_synonym)
    pos_tagged = nltk.pos_tag(text)
    nn = filter(lambda x:x[1]=='NN',pos_tagged)
    list_nn = list(nn)
    nnp = filter(lambda x:x[1]=='NNP',pos_tagged)
    list_nnp = list(nnp)
    nns = filter(lambda x:x[1]=='NNS',pos_tagged)
    list_nns = list(nns)
    comb_nouns = list_nn + list_nnp + list_nns
    input_nouns = [i[0] for i in comb_nouns]
    input_nouns= ' '.join(input_nouns)
    ratios = process.extract(input_nouns, flat_list, limit=1000)
    result = []    
    for i in ratios:
        if input_nouns in i[0]:
            result.append(i)
    return result    

get_matching_fruits('blue shaped pear was found today')

So, in my code, I want to have the result list contain all the possible matches given any input in question.所以,在我的代码中,我想让result list包含所有可能的匹配项,给定任何有问题的输入。 Any help on this will be highly welcomed.对此的任何帮助都将受到高度欢迎。

The simplest solution for me is this.对我来说最简单的解决方案是这个。

foo = 'fruit apple'
bar = [('apple fruit', 91), 
       ('the fruit is an apple', 34), 
       ('banana apple', 78), 
       ('guava tree', 11), 
       ('delicious apple', 88)]

matches = []
for entry in bar:
    for word in foo.split():
        # break if we meet a point where the word isn't found
        if word not in entry[0]:
            break
    # the else is met if we didn't break from the for loop
    else:
        matches.append(entry)

print(matches)

Sorry if i kinda dint understand the question properly, but why do u even need an NLTK library to do this.. this is a simple list comprehension problem对不起,如果我有点理解这个问题,但是为什么你甚至需要一个 NLTK 库来做到这一点.. 这是一个简单的列表理解问题

In [1]: tup = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]

In [2]: input_string = 'fruit apple'

In [3]: input_string_set =  set(input_string.split(' '))

In [4]: input_string_set
Out[4]: {'apple', 'fruit'}

In [10]: [t for t in tup if input_string_set.issubset(set(t[0].split(' ')))]
Out[10]: [('apple fruit', 91), ('the fruit is an apple', 34)]

In [11]:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM