查找輸入字符串到 Python 中元組列表的所有可能匹配項（以任何順序/順序）

Question

我想將輸入字符串與元組列表匹配，並從元組列表中找出前 N 個接近匹配項。 元組列表有大約 2000 個項目。 我面臨的問題是我使用fuzzywuzzy process.extract method ，但它返回了大量具有相同置信度分數的元組。 比賽質量也不好。 我想做的是根據我的輸入獲取所有匹配項（順序不重要）

Example: 
input string: 'fruit apple'
    
List of tuples = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]

從這里我想從字符串列表中找到所有字符串，其中包含任何順序的單詞“fruit apple”。

Expected output:
[('apple fruit', 91), ('the fruit is an apple', 34)]

我知道fuzzywuzzy，它是1行代碼，但問題是當要檢查的元組列表的大小非常大時，fuzzywuzzy會為不相關的項目分配相同的置信度分數。

到目前為止嘗試附加代碼以供參考：

def preprocessing(fruit):
    stop_words = stopwords.words('english')
    fruit_string = re.sub(r'[a-z][/][a-z][/]*[a-z]{0,1}', '', fruit_string)
    fruit_string = re.sub(r'[^A-Za-z0-9\s]+', '', fruit_string)
    return ' '.join(each_word for each_word in fruit_string.split() if each_word not in stop_words and len(each_word) > 2)
    

#All possible fruit combination list
nrows=[]
with open("D:/fruits.csv", 'r') as csvfile: 
    csvreader = csv.reader(csvfile)
    fields = next(csvreader)
    for row in csvreader: 
        nrows.append(row)
        
flat_list = [item for items in nrows for item in items]        



def get_matching_fruits(input_raw_text):
    preprocessed_synonym = preprocessing(input_raw_text)
    text = nltk.word_tokenize(preprocessed_synonym)
    pos_tagged = nltk.pos_tag(text)
    nn = filter(lambda x:x[1]=='NN',pos_tagged)
    list_nn = list(nn)
    nnp = filter(lambda x:x[1]=='NNP',pos_tagged)
    list_nnp = list(nnp)
    nns = filter(lambda x:x[1]=='NNS',pos_tagged)
    list_nns = list(nns)
    comb_nouns = list_nn + list_nnp + list_nns
    input_nouns = [i[0] for i in comb_nouns]
    input_nouns= ' '.join(input_nouns)
    ratios = process.extract(input_nouns, flat_list, limit=1000)
    result = []    
    for i in ratios:
        if input_nouns in i[0]:
            result.append(i)
    return result    

get_matching_fruits('blue shaped pear was found today')

所以，在我的代碼中，我想讓result list包含所有可能的匹配項，給定任何有問題的輸入。 對此的任何幫助都將受到高度歡迎。

Answer 1

對我來說最簡單的解決方案是這個。

foo = 'fruit apple'
bar = [('apple fruit', 91), 
       ('the fruit is an apple', 34), 
       ('banana apple', 78), 
       ('guava tree', 11), 
       ('delicious apple', 88)]

matches = []
for entry in bar:
    for word in foo.split():
        # break if we meet a point where the word isn't found
        if word not in entry[0]:
            break
    # the else is met if we didn't break from the for loop
    else:
        matches.append(entry)

print(matches)

Answer 2

對不起，如果我有點理解這個問題，但是為什么你甚至需要一個 NLTK 庫來做到這一點.. 這是一個簡單的列表理解問題

In [1]: tup = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]

In [2]: input_string = 'fruit apple'

In [3]: input_string_set =  set(input_string.split(' '))

In [4]: input_string_set
Out[4]: {'apple', 'fruit'}

In [10]: [t for t in tup if input_string_set.issubset(set(t[0].split(' ')))]
Out[10]: [('apple fruit', 91), ('the fruit is an apple', 34)]

In [11]:

查找輸入字符串到 Python 中元組列表的所有可能匹配項（以任何順序/順序）

問題描述

2 個解決方案

解決方案1
1 已采納 2020-08-13 16:57:35

解決方案2
1 2020-08-13 16:59:09

查找輸入字符串到 Python 中元組列表的所有可能匹配項（以任何順序/順序）

問題描述

2 個解決方案

解決方案1 1 已采納 2020-08-13 16:57:35

解決方案2 1 2020-08-13 16:59:09

解決方案1
1 已采納 2020-08-13 16:57:35

解決方案2
1 2020-08-13 16:59:09