简体   繁体   中英

What can I use for finding names words in two list? Python

I am interested in the finding of the same words in two lists. I have two lists of words in the text_list I also stemmed the words.

text_list = [['i', 'am', 'interest' ,'for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']

So I need this output:

same_words= ['a', 'sentence', 'interest']

You need to apply stemming to both the lists, There are discrepancies for example interesting and interest and if you apply stemming to only words_list then Sentence becomes sentenc so, therefore, apply stemmer to both the lists and then find the common elements:

from nltk.stem import PorterStemmer

text_list = [['i', 'am', 'interest','for', 'this', 'subject'], ['this', 'is', 'a', 'second', 'sentence']]
words_list = ['a', 'word', 'sentence', 'interesting']

ps = PorterStemmer()
words_list = [ps.stem(w) for w in words_list]
text_list = [list(map(ps.stem,i)) for i in text_list]

answer = []
for i in text_list:
    answer.append(list(set(words_list).intersection(set(i))))

output = sum(answer, [])
print(output)

>>> ['interest', 'a', 'sentenc']

There is a package called fuzzywuzzy which allows you to match the string from a list with the strings from another list with approximation.

First of all, you will need to flatten your nested list to a list/set with unique strings.

from itertools import chain
newset =  set(chain(*text_list))

{'sentence', 'i', 'interest', 'am', 'is', 'for', 'a', 'second', 'subject', 'this'}

Next, from the fuzzywuzzy package, we import the fuzz function.

from fuzzywuzzy import fuzz

result = [max([(fuzz.token_set_ratio(i,j),j) for j in newset]) for i in words_list]

[(100, 'a'), (57, 'for'), (100, 'sentence'), (84, 'interest')]

by looking at here, the fuzz.token_set_ratio actually helps you to match the every element from the words_list to all the elements in newset and gives the percentage of matching alphabets between the two elements. You can remove the max to see the full list of it. (Some alphabets in for is in the word , that's why it's shown in this tuple list too with 57% of matching. You can later use a for loop and a percentage tolerance to remove those matches below the percentage tolerance)

Finally, you will use map to get your desired output.

similarity_score, fuzzy_match = map(list,zip(*result))

fuzzy_match
Out[40]: ['a', 'for', 'sentence', 'interest']

Extra

If your input is not the usual ASCII standard, you can put another argument in the fuzz.token_set_ratio

a = ['У', 'вас', 'є', 'чашка', 'кави?']

b = ['ви']

[max([(fuzz.token_set_ratio(i, j, force_ascii= False),j) for j in a]) for i in b]
Out[9]: [(67, 'кави?')]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM