简体   繁体   中英

Find all possible matches (in any order/sequence) of an input string to a list of tuples in Python

I want to match an input string to a list of tuples and find out the top N close matches from the list of tuples. The list of tuple has around 2000 items. The problem I am facing is that I have used fuzzywuzzy process.extract method but it returns a huge number of tuples with the same confidence score. The quality of match is also not good. What I would like to do is get all the matches based on my input(order is not important)

Example: 
input string: 'fruit apple'
    
List of tuples = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]

From here I want to find all strings from the list of strings which contains both the word 'fruit apple' in any order.

Expected output:
[('apple fruit', 91), ('the fruit is an apple', 34)]

I know with fuzzywuzzy, it's 1 line of code but the issue is when the size of the list of tuples to be checked against is very large, fuzzywuzzy assigns the same confidence score to unrelated items.

Attaching code tried till now for reference:

def preprocessing(fruit):
    stop_words = stopwords.words('english')
    fruit_string = re.sub(r'[a-z][/][a-z][/]*[a-z]{0,1}', '', fruit_string)
    fruit_string = re.sub(r'[^A-Za-z0-9\s]+', '', fruit_string)
    return ' '.join(each_word for each_word in fruit_string.split() if each_word not in stop_words and len(each_word) > 2)
    

#All possible fruit combination list
nrows=[]
with open("D:/fruits.csv", 'r') as csvfile: 
    csvreader = csv.reader(csvfile)
    fields = next(csvreader)
    for row in csvreader: 
        nrows.append(row)
        
flat_list = [item for items in nrows for item in items]        



def get_matching_fruits(input_raw_text):
    preprocessed_synonym = preprocessing(input_raw_text)
    text = nltk.word_tokenize(preprocessed_synonym)
    pos_tagged = nltk.pos_tag(text)
    nn = filter(lambda x:x[1]=='NN',pos_tagged)
    list_nn = list(nn)
    nnp = filter(lambda x:x[1]=='NNP',pos_tagged)
    list_nnp = list(nnp)
    nns = filter(lambda x:x[1]=='NNS',pos_tagged)
    list_nns = list(nns)
    comb_nouns = list_nn + list_nnp + list_nns
    input_nouns = [i[0] for i in comb_nouns]
    input_nouns= ' '.join(input_nouns)
    ratios = process.extract(input_nouns, flat_list, limit=1000)
    result = []    
    for i in ratios:
        if input_nouns in i[0]:
            result.append(i)
    return result    

get_matching_fruits('blue shaped pear was found today')

So, in my code, I want to have the result list contain all the possible matches given any input in question. Any help on this will be highly welcomed.

The simplest solution for me is this.

foo = 'fruit apple'
bar = [('apple fruit', 91), 
       ('the fruit is an apple', 34), 
       ('banana apple', 78), 
       ('guava tree', 11), 
       ('delicious apple', 88)]

matches = []
for entry in bar:
    for word in foo.split():
        # break if we meet a point where the word isn't found
        if word not in entry[0]:
            break
    # the else is met if we didn't break from the for loop
    else:
        matches.append(entry)

print(matches)

Sorry if i kinda dint understand the question properly, but why do u even need an NLTK library to do this.. this is a simple list comprehension problem

In [1]: tup = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]

In [2]: input_string = 'fruit apple'

In [3]: input_string_set =  set(input_string.split(' '))

In [4]: input_string_set
Out[4]: {'apple', 'fruit'}

In [10]: [t for t in tup if input_string_set.issubset(set(t[0].split(' ')))]
Out[10]: [('apple fruit', 91), ('the fruit is an apple', 34)]

In [11]:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM